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Abstract —The framework of cognitive wireless radio is ex¬ 
pected to endow the wireless devices with the cognition- 
intelligence ability, with which they can efficiently learn and 
respond to the dynamic wireless environment. In many practical 
scenarios, the complexity of network dynamics makes it difficult 
to determine the network evolution model in advance. As a result, 
the wireless decision-making entities may face a black-box net¬ 
work control problem and the model-based network management 
mechanisms will be no longer applicable. In contrast, model-free 
learning has been considered as an efficient tool for designing 
control mechanisms when the model of the system environment 
or the Interaction between the decision-making entitles is not 
available as a-priorl knowledge. With model-free learning, the 
decision-making entities adapt their behaviors based on the 
reinforcement from their interaction with the environment and 
are able to (implicitly) build the understanding of the system 
through trlal-and-error mechanisms. Such characteristics of 
model-free learning is highly in accordance with the requirement 
of cognition-based Intelligence for devices in cognitive wireless 
networks. Recently, model-free learning has been considered as 
one key implementation approach to adaptive, self-organized 
network control in cognitive wireless networks. In this paper, we 
provide a comprehensive survey on the applications of the state- 
of-the-art model-free learning mechanisms in cognitive wireless 
networks. According to the system models that those applications 
are based on, a systematic overview of the learning algorithms 
in the domains of single-agent system, multi-agent systems and 
multi-player games is provided. Furthermore, the applications 
of model-free learning to various problems in cognitive wireless 
networks are discussed with the focus on how the learning 
mechanisms help to provide the solutions to these problems and 
improve the network performance over the existing model-based, 
non-adaptive methods. Finally, a broad spectrum of challenges 
and open issues is discussed to offer a guideline for the future 
research directions. 

Index Terms —Cognitive radio, heterogeneous networks, 
decision-making, reinforcement learning, game theory, model- 
free learning. 


I. Introduction 
A. Cognitive Radio Networks 

The original concept of Cognitive Radio (CR) was first 
proposed a little over one decade ago IT]. In a broad sense, 
CR is defined as a prototypical radio framework that adopts 
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a radio-knowledge-representation language for the software- 
defined radio devices to autonomously learn about the dy¬ 
namics of radio environments and adapt to changes of appli¬ 
cation/protocol requirements. In recent years. Cognitive Radio 
Networks (CRNs) have been widely recognized from a high- 
level perspective as an intelligent wireless communication 
system. A device in a CRN is expected to be aware of 
its surrounding environment and uses the methodology of 
understanding-by-building to reconfigure the operational pa¬ 
rameters in real-time, in order to achieve the optimal network 
performance 121, Q. In the framework of CRNs, the following 
abilities are typically emphasized; 

■ radio-environment awareness by sensing (cognition) in a 
time-varying radio environment; 

■ autonomous, adaptive reconfigurability by learning (intel¬ 
ligence); 

■ cost-efficient and scalable network configuration. 

Many recent studies on CR technologies focus on radio¬ 
environment awareness in order to enhance spectrum effi¬ 
ciency. This leads to the concept of Dynamic Spectrum Access 
(DSA) networks a, which are featured by a novel PHY- 
MAC architecture (namely, primary users vs. secondary users) 
for opportunistic spectrum access based on the detection of 
spectrum holes 0. It is worth noting that by emphasiz¬ 
ing the network architecture of spectrum sharing between 
the licensed/primary networks and the unlicensed/secondary 
networks a, “DSA networks” is frequently considered a 
terminology that is interchangeable with “CR networks” 0. 
The rationale behind such a consideration is that a secondary 
network relies on spectrum cognition modules to make proper 
decisions for seamless spectrum access without interfering 
the primary transmissions. For this category of works in 
the literature, “learning” is mostly about the techniques of 
feature classification for primary signal identification 161 . For 
an overview of the relevant techniques, the readers may refer 
to recent survey works in Q-El. 

However, in order to achieve autonomous and cost-efficient 
network configuration, the functionalities of self-organized, 
adaptive reconfigurability also become fundamental for CRNs, 
since these functionalities shape the mechanisms of network 
control and transmission strategy acquisition. By emphasiz¬ 
ing such an objective, the network management mechanism 
is required to dynamically characterize the situation of the 
decision-making entities in the network and accordingly infer 
the proper transmission strategies. As the network manage¬ 
ment mechanisms in conventional wireless networks are ac- 
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Fig. 1. Relationship between the functionalities of cognition and intelligence 
in a cognitive wireless network. 


quiring more and more levels of such a cognition-intelligence 
ability, the border between a pure CRN (namely, a CRN 
in the sense of DSA networks) and a conventional wire¬ 
less network is gradually diminishing cni, im. In recent 
years, the emerging networking technologies (e.g., CRNs 
and self-organized networks 1121 . ifTSl l emphasize more on 
autonomous, adaptive reconhgurability. For these networks, 
the concept of “intelligent network management” based on 
“cognition” can be re-dehned as providing the functionalities 
of autonomous transmission policy adaptation according to the 
radio-environment awareness capability of the CR devices in 
numerous dimensions across the networking protocol stacks 
im. In Figure [T] we provide an overview of the perceivable 
network states for cognition and the cross-layer network func¬ 
tionalities for conhguration in cognitive wireless networks. 
Interested readers are referred to recent surveys such as m, 
ca for more details about the CR applications in different 
protocol layers. 

Considering the distributed nature of wireless networks, a 
good CR-based framework of autonomous network conhg¬ 
uration in time-varying environments needs to address the 
following questions; 

1) How to properly conhgure the transmission parameters 
with limited ability of network modeling or environment 
observation? 

2) How to coordinate the distributed transmitting entities 
(e.g., end users and base stations) with limited resources 
for information exchange? 

3) How to guarantee the network convergence under the con¬ 
dition of interest conhicts among transmitting entities? 

The need to address question 1) lies in the fact that in 
practical scenarios, the abilities of environment perception 
may be limited on different levels and/or for different devices. 
Therefore, the solution to the problems raised by question 1) 
requires that a decision making mechanism should be able 
to learn the transmission policies without explicitly knowing 
the accurate mathematical model of the networks beforehand. 
Meanwhile, questions 2) and 3) are raised by the basic require¬ 
ment of a self-organized, distributed control system. Only by 


addressing questions 2) and 3) can the network conhguration 
process be efficient in both information acquisition and policy 
computation. In summary, the key to answering questions 1), 
2) and 3) lies in the prospect of enabling the devices in CRNs 
to distributively achieve their stable operation point under the 
condition of information incompleteness/locality. 

B. From Model-Based Network Management to Model-Free 
Strategy Learning 

When the designer of (distributed) network-controlling 
mechanisms has complete and global information, the network 
control problem are frequently addressed in the model-base 
ways such as the optimization-decomposition-based formula¬ 
tion/solution M- With a model-base design methodology, the 
network control algorithms are usually designed as a set of 
distributed computations by the network entities (also known 
as decision-making agents in the domain of control theory) 
to solve a global constrained optimization problem through 
decomposition. Under such a framework, since the model of 
the network dynamics is known in advance, there is no need 
for “learning” anything about the network dynamics other than 
the time-varying network parameters. However, in order to 
adopt such a design methodology, it is necessary to assume 
that the set of the network parameters (e.g., channel informa¬ 
tion and channel availability probabilities) that determines the 
target network utilities is fully available or perfectly known 
to all the CR device^. If an equilibrium flSl of a multi¬ 
entity network is expected instead of the global optimality, the 
game theoretic approaches (e.g., for multiple access problems 
lfT9ll and network security problems l20l) can also be adopted. 
Similar to the optimization-decomposition-based solutions, the 
game theoretic approaches may still depend on a pre-known 
model of the network dynamics. In this case, the mathematical 
tools of optimization theory can also be used for the game 
theoretic approaches to achieve the goal of obtaining an 
equilibrium or locally optimal payoff, given that the strategies 
of the other network entities are accessible. 

However, due to the practical limitation of information 
incompleteness/locality, directly applying the model-based so¬ 
lutions will face difficulties since a model of the network 
dynamics may even not be available in advance, or in most 
cases its details may be inaccurate or not instantaneously 
known to every device. Under the model-based framework, the 
attempts to conquer the obstacles of information incomplete¬ 
ness/inaccuracy are limited within a small scope by allowing 
more uncertainty/inaccuracy in the a-priori network model. 
Examples of these attempts include the introduction of robust 
control (e.g., variation inequality for spectrum sharing ll2Tl l 
and fuzzy logic (e.g., fuzzy logic for call admission control 
US). Nevertheless, these techniques still lack the strength of 
fully addressing the three questions raised in Section lUAl 

The difficulty of obtaining an accurate model in advance 
for dynamic network control in practical scenarios can be 
illustrated by a multimedia transmission task over an one- 
hop OFDM-based ad-hoc network (Figure |2]l. In the network 

* More details about the common assumptions for the model-based methods 
can be found in □3 
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Fig. 2. Scalable video transmission over a one-hop OFDM-based ad-hoc 
network. 


illustrated by Figure |2] the goal of the transmitter-receiver pairs 
is to achieve the minimized end-to-end distortion through joint 
power allocation, channel code adaptation and source coding 
control over dynamic channels. In the practical situation, the 
obstacle for obtaining an appropriate device behavior model 
first lies in the difficulty in constructing an accurate end-to-end 
rate-distortion model at the source codec level, since modeling 
the rate-distortion relationship for MPEG-4 Scalable Video 
Coding (SVC) mechanism is notoriously difficult ll2^ . More¬ 
over, one analytical model may only apply to a certain category 
of video sources E3i . Meanwhile, the stochastic evolution of 
the channel condition makes it difficult to predict the transition 
of the states for the channel-coding/retransmission mechanism, 
which in return will result in uncertain error propagation at 
the video decoder of the receiver ll24l . Furthermore, when 
distributed power control and subcarrier allocation mechanism 
is adopted, it is impractical for a transmitter-receiver pair to 
fully observe the transmission behaviors of the other pair of 
nodes, thus rendering the optimal power-channel allocation 
difficult with merely the local channel observation. As a result, 
without knowing the end-to-end distortion model, the channel 
evolution model and the information of peer-node behaviors, 
the wireless nodes are facing a black-box optimization prob¬ 
lem with a limited level of coordination. In this situation, 
it will be difficult to apply the aforementioned model-based 
methods for the solution of video transmission control. 

In the scenarios of black-box network optimization/control 
with limited signaling, it is highly desirable that the network 
control mechanisms do not depend on the a-priori design 
of the devices’ behavior model. As a result, the methods 
of controlling-by-learning without the need for the a-priori 
network model, namely, the model-free decision-making ap¬ 
proaches Il25]l . Il26l . are considered more proper, especially 
within the framework of CR technologies. In the context of 
adaptive control, controlling-by-Iearning in CRNs is usually 
described by the cognition-decision paradigm (Figure [2l jT). 
This paradigm describes the learning-based strategy-taking 
process of a single device from a high-level perspective and 



Fig. 3. Cognition cycle of a single wireless device fTl . 

interprets it as a cognition cycle to present the information 
flow from environment cognition to the final network control 
decision. In the paradigm, the model-based decision making 
process is replaced by the observation-decision-action-leaming 
loop. However, the paradigm itself does not provide any 
detail on how much information about the system model 
should be learned before a proper transmission strategy can be 
determined, or in what way the information could be learned. 

Under the settings of not knowing a network model in 
advance, the strategy-learning process can be further divided 
into two categories according to the ways of using the model 
knowledge obtained from the learning process: the “model- 
dependent” methods and the “model-free” methods ||25l. For 
model-dependent learning, an arbitrary division exists between 
the learning phase and the decision phase, and the goal of 
learning is to construct the network model first and then use 
it to derive the network control strategies. By contrast, model- 
free learning directly learns the network controller without 
explicitly learning the network model in advance. Early re¬ 
search has pointed out that the model-dependent learning 
methods are generally more computationally intensive, while 
model-free learning makes a trade-off of the time to reach 
controller convergence for reducing computational complexity 
1251. Although most of the existing research on strategy¬ 
learning methods in wireless networks focus on model-free 
learning due to the limited computational resources in mobile 
devices, recent years have seen a tendency that the border 
between the two categories of strategy-learning methods keeps 
diminishing ll2^ . 

C. A Brief Review of the Existing Survey Works on Learning 
in CRNs 

As indicated by our discussion in Sections ll-Al and lUBl the 
problem domain of learning in cognitive wireless networks 
can be divided into two categories: the problems of wireless 
environment cognition (namely, spectrum sensing) Q-a 
and the problems of network management (namely, strategy 
learning). The solutions to the former problem sub-domain 
generally provide the information that works as the feed-in 
to the strategy managers of the latter problem sub-domain. In 
the literature, the existing surveys on the network management 
problems are generally organized in accordance with the pro¬ 
tocol layers of the OSI/ISO model. These problems include the 
DSA-based MAC protocol design in CRNs 0, lH, IZTl . IJS), 
routing protocol design in CRNs m, and cross-layer 
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TABLE I 

Summary of existing survey works on CR networking problems and model-free learning methods 


Problem Domain of 
Cognitive Wireless 

Networks 

Sub-domain of CR Network¬ 
ing Problems 

Category of Corresponding 
Machine-learning Methods 

Sub-category of Learning Methods 

Wireless environment 
cognition 

Spectrum sensing |7|-|9| 

Supervised learning (pattern 
classification) (6), 13^ 

N/A 

Network management 

DSA-based MAC protocol 
design m, (3, dll, dH 

Unsupervised learning 
(model-free leai'ning) 

dn, da, da da 

Single-agent-based reinforcement 

learning 1251 

Spectrum-aware routing 1141. 

Multi-agent-based reinforcement 

learning 1261. 1341 

Self-organization jfzl. |3 q1 

Learning automata (3^ 

Network security 1201. Dll 

Repeated-game-based learning 1331. 
(36), (37) 


network control problems in CRNs such as self-organization 
ca, EOl and network security problems EqI, EH. 

With respect to different domains of networking problems, 
the pool of the potential machine-learning-based solutions 
can also be grouped into two major categories. For the 
problems of spectrum sensing, the survey on applications of 
signal-classification-oriented learning methods can be found in 
recent studies such as 0, Ea. For the network management 
problems, the (model-free) strategy-learning-based solutions 
are generally identified as belonging to the category of un¬ 
supervised learning 0. More specifically, the techniques of 
controlling-by-learning in CRNs are usually featured by the 
trial-and-error interactions with the dynamic wireless environ¬ 
ment and thus also known as “reinforcement learning” (see 
our discussion in Section M- In the past decade, researchers 
have paid a significant attention to the confluence of adaptive 
control, model-free learning and game theory Ei, ES-In the 
domain of CRNs, it is believed that such a trend will lead to a 
promising solution of the various network control/resource al¬ 
location problems (e.g., 1281 . ED)- In return, the development 
of the recent network technologies, such as self-organized 
networks and CRNs, is increasingly demanding more efficient 
learning mechanisms to be implemented for an adaptive, self- 
organized solution. 

Most of the existing model-free learning methods for net¬ 
work control in CRNs find their origin in the domain of control 
theory. In the literature, important surveys on these model- 
free learning methods from the perspective of control/game 
theory include ESI, ESI, ESl - ESl . In the context of network 
control, existing survey works on the applications of strategy 
learning usually focus on a certain sub-category of these 
learning methods. In ES), EtI . comprehensive surveys on 
distributed learning mechanisms are provided based on the 
framework of repeated games (see our discussion in Section 
Ill-Cb . In 0, ESl . the surveys on model-free learning in 
CRNs place the focus more directly on the Q-learning based 
methods (see our discussion in Section Hi- Ab . Apart from the 
aforementioned works, other survey works on strategy learning 
in wireless networks usually focus on a specific sub-domain of 
applications such as wireless ad-hoc networks E9l and sensor 
networks EOl. To assist the readers in obtaining an overview 
of the development of model-free learning methods and their 
relationship with the network management problems in CRNs, 
we summarize the aforementioned survey works according to 
the domains they belong to in Table |T] 


TABLE II 

Summary oe acronyms for wireless networking terminologies 


Terminologies 

Abbreviations 

Base station 

BS (Section mil IVl 

Cognitive radio 

CR (Section III mil lIVlIVi 

Cognitive radio networks 

CRNs ISectionlIIIIIIIIIV(rvl~ 

Dynamic channel assignment 

DCA ISection IIIII 

Dynamic spectrum access 

PSA (Section Ilium 

Key performance indicator 

KPI (Section IVn 

Network operator 

NO (Section jV) 

Primary user 

PU fSectionlllllllVIlVI 

Signal-to-interference-plus-noise-ratio 

SINR rSectionllVIlVI 

Signal-to-noise-ratio 

SNR (Section mil IVI 

Service provider 

SP (Section 1 VI 

Secondary user 

SU (Section mil IIVIIVI 

Heterogeneous networks 

HETNET fSection llVt 


D. Organization of the Paper 

This paper is devoted to providing a comprehensive sur¬ 
vey on the current development of model-free learning in 
the context of the cognitive wireless networks. In order to 
highlight the difference in the existing level of information 
incompleteness/locality (from another perspective, the degree 
of information coupling) for different learning mechanisms, 
we organize the survey on the applications of learning in 
CRNs into three major categories: (a) strategy learning based 
on the single-agent systems, (b) strategy learning based on 
the loosely coupled multi-agent systems and (c) strategy 
learning in the context of games. In Section [III the necessary 
background and the preliminary concepts of learning in the 
single-agent system, the distributed, multi-agent systems and 
games are provided. In Section HIHIVI the recent research 
on the applications of the three major categories of model- 
free learning mechanisms in CRNs is reviewed according to 
the different system models that the learning mechanisms are 
based on. In Section IVII some important open issues for the 
application of model-free learning in CRNs are outlined in 
order to provide the insight into the future research directions. 
Finally, we summarize and conclude the paper in Section Ivnl 
In Table HI] and Table |In| we provide an acronym glossary of 
the terms used in the paper. 

II. Background: Model-Free Learning in the 
Domains of Distributed Control and Game Theory 

Although the applications of model-free learning in wireless 
networks only became more commonplace in the early 2000s, 
the fundamental development of the model-free learning the¬ 
ory can be traced back much earlier, to the 1980s Bdll . Ii42ll . 
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TABLE IV 

Sequential decision-making models in a nutshell 


General Model 

Specific 

Model 

Tuple-Based Model Description 

Agent- 

Strategy 

Coupling 

Objective 

Utility 

Measurement 

Multi-agent 

Markov 

Decision 

Process (MDP) 

/ Stochastic 
Game (SG) 

Single-agent 

MDP 

(5, A, r, Pr(s' |s, a)) 

N/A 

Utility optimization 

Accumulated 

utility 

Multi-agent 

MDP 

(A/,6’ = xi>n,A = a)) 

Allowed 

Utility optimization 

Accumulated 

utility 

Stochastic 

games 

(A/,6’ = x6’r,,,4 = x>l„,{r„}„gjv')t’i'(s'|s, a)) 

Always 

Reaching equilibria 

Accumulated 

utility 

Repeated 

games 

{AI , -A = Xj\.n, \ 

Always 

Reaching equilibria 

Accumulated 

utility 

Static games 

{AJ , ^ = X^TT,, 

Always 

Reaching equilibria 

Instantaneous 

utility 


TABLE III 

Summary oe acronyms for model-free learning terminologies 


Terminologies 

Abbreviations 

Actor-critic learning 

AC-learning (Section 

[y3 

Actor-critic learning automata 

ACLA (Section jn) 

Correlated equilibrium 

CE (Section InllV) 

Correlated-Q learning 

CE-Q learning (Section jV) 

Constrained Markov decision process 

CMDP fSection IIIII 

combined fully Distributed PAyoff and 
Strategy-Reinforcement Learning 

CODIPAS-RL (Section jllj 

m 

Derivative-action gradient play 

DAGP (Section IV) 

Dynamic programming 

DP (Section jll) 

Distributed reward and value function 

DRV function IIVI 

Distributed value function 

DVE (Section llVt 

Experience-weighted attraction learning 

EWAL fSection IVII 

Fictitious play 

FP (Section III] jV) 

Greedy policy searching in the limit of 
infinite exploration 

GLIE (Sectninln} 

Gradient play 

GP /Section llin 

Learning automata 

LA (Section nil |Vt 

Linear-reard-inaction algorithm 

Ln-i (Section nil IV> 

Multi-agent MaiLov decision process 

MAMDP (Section nil mil 

Multi-agent system 

MAS (SectionllllllVIlVIlVir 

Markov decision process 

MDP (Section IIIIIIIII 

Nash equilibrium 

NE (Section nil IVIIVn 

Observation-orient-decision-action loop 

OODA loop (Section [I) 

Partially observable Markov decision 
process 

POMDP (Section mil lIVl 

Single-agent Markov decision process 

SAMDP (Section nil lllh 

State-action-reward-state-action 

SARSA (Section IIII IIIII 

Single-agent system 

SAS (Section IIII llllt 

Stochastic games 

SGs (Section IIII IVt 

Smoothed/Stochastic fictitious play 

SFP (Section IV) 

Simultaneous perturbation stochastic ap¬ 
proximation 

SPSA (Section IV) 

Temporal difference learning 

TD-learning (Sectionlllllllll 

ED 

Transfer learning 

TL (Section [VI) 


In this section, we provide a necessary introduction of the 
general-purpose learning methods that are developed in the 
domains of distributed control and game theory. To assist 
our discussion about learning techniques applied to cognitive 
wireless networks, we categorize the learning methods by the 
degree of coupling among the decision-making agents with 
respect to different system models. In what follows, we will 
briefly introduce the general-purpose learning algorithms that 
are built upon the decision-making models of single-agent 
systems, loosely coupled multi-agent systems and game-based 
multi-agent systems. Before proceeding to more details of the 
learning mechanisms, we first provide an overview of these 
decision-making models in Table |IV] The notations used in 


TABLE V 

Summary of the main notations in SectionITII 


Symbol 

Meaning 

t 

Timing index 

a 

A single action of the decision-making agent in a single¬ 
agent system 

^—n 

The joint action of the adversary agents for agent n in 
a game 


A finite set of actions for agent n in a multi-agent 
system 

S 

A single environment state of the agent in a single-agent 
system 

Sn 

A finite set of environment states for agent n in a multi¬ 
agent system 

Un (Sn , dn) 
or Un 

Instantaneous utility function of agent n in a multi¬ 
agent system 

Pr(.) 

State transition probability function 

p 

The discount factor for a discounted-reward MDP 

7r(s, a) or tt 

The policy mapping function of an agent from a given 
state to an action 

TT* 

An optimal or equilibrium policy 

7r(s,a_n) 
or IV-ri 

The joint policy of the adversary agents for agent n in 
a game 


The state-value function of a discounted-rewai'd MDP 
from the starting state s 

Q^(s,a) 

The state-action value function of a discounted-reward 
MDP from taking action a at the starting state s 


The state-value function of an average-reward MDP 
from the starting state s 

V^(s) 

The bias utility of an average-reward MDP from taking 
policy TT at starling state s 

at, dt 

The learning rates 

r 

The (normalized) value of environment response used 
by learning automata algorithms 


this section are list in Table IV] 


A. Single-Agent Strategy Learning 

In the context of distributed control and robotics, single¬ 
agent learning has been considered as the most fundamental 
class of the strategy-learning methods. Single-agent learning 
generally assumes that the learning agent has full access to 
the state information that can be obtained about the system. 
Frequently, the terminologies “reinforcement learning” and 
“model-free learning” are (partially) used interchangeably to 
refer to the decision-making process of a single agent. The 
agent learns to improve its performance by merely observing 
the state changes in its operational environment and the 
utility feedback that it received after taking an action. In 
the recent surveys on reinforcement-learning theory and its 
applications ia, Ea, such a decision-learning process is 
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Fig. 4. The OODA loop, often known as the cognition cycle [T]. 

described by an abstract model, namely, the Observe-Orient- 
Decision-Action (OODA) loop I?). The OODA loop (Figure 
m can be considered as a generalized model of the cognition 
cycle in the context of cognitive wireless networks (Figure 
El), and it provides a generic description of the information 
flow in the intelligent decision-making process. However, it 
is the task of the specific reinforcement-learning methods to 
define the rules of agent behaviors that guide the interaction 
with the to-be-explored environment. Since in most of the 
practical scenarios, a learning agent needs to deal with en¬ 
vironment uncertainty, in the literature, a Markov Decision 
Process (MDP) ||43]| becomes a prevalent tool for abstracting 
the model of the agent-environment interaction. Based on the 
MDP framework, various model-free learning methods such as 
Temporal Difference (TD) learning ll44ll and learning automata 
lE5l can be adopted to define the behavior rules of an agent. 

The standard (single-agent) MDP model is used to describe 
a stochastic Single-Agent System (SAS). Mathematically, a 
single-agent MDP is defined as follows: 

Definition 1 (Single-agent MDP ll26l ). A single-agent MDP 
is defined as a 4-tuple: (5, yf, u, Pr(s'|s, a)), in which 

• S = {si,..., S| 5 |} is a finite set of environment states, 

• A = {ai,..., a|^|} is a finite set of agent’s actions, 

• it:iSx,4xiS —the instantaneous utility function, 

• Pr : 5 X ,4 X 5 —>■ [0, 1] A the state transition probability 
function, which retains the Markovian property. 


In the MDPs, the underlying environment is a stationary 
stochastic process, and the consequences of the decisions can 
be probabilistic. The goal of a decision-learning agent is to find 
the proper stationary policy, 7r = Pr(a|s) that probabilistically 
maps state s to action a so that the accumulated long-term 
utility of the agent is optimized. With respect to different 
applications, the objectives of the MDPs may appear in 
different forms. In this survey, we will mainly consider two 
types of the infinite-horizon objectives ||25l as follows: 

• the discounted-reward MDP with the discount factor (3 S 

[ 0 , 1 ]: 

/ CO \ 

” ( 1 ) 


( 2 ) 



yp {s) = Ut[st, a) 


the average-reward MDP; 


/T-l 


h^{s) = lim —E^ . 

T-).oo 1 ' ^' I 


, t=0 


Both types of MDPs can be represented in the form of the 
Bellman optimality equation. For the discounted-reward MDP, 
the Bellman equation can be represented either by the state- 
value function starting from state s under policy tt: 

V^{s) = E^{u{s,a)) + Y. y<s'\s,iT)V^{s'), (3) 

s' 

or by the state-action value function (Q-function) that starts 
from taking action a at state s and follows policy tt thereafter: 

Q^(s, a) = u{s, a)+Y Pr(s'|s, a)Vp{s'). (4) 

8'^S 


In order to express the average-reward MDP in the form of 
the Bellman equation, the average adjusted sum of utility (i.e., 
bias) following policy tt is introduced as follows: 


V^{s) 


lim Et^ 
T ^oo 


/T-l 

E 


{ut{st,a) - h^{s)) 


(5) 


with which the average-reward MDP can be expressed by the 
state-value functior0: 


V^is) -\- h'"{s) = Et,{u{s, a)) + Y Pr(s'|s, 7r)V(s'). (6) 

s'€S 


With a variety of on-line learning methods that estimate the 
optimal Q-value or the bias value, a broad spectrum of value- 
iteration-based learning algorithms have been proposed ll2^ . 
m- Among them, the most widely used model-free learning 
algorithm is Q-learning ll44ll . which estimates the state-action 
value in (01) of a discounted MDP based on the time difference 
of the estimated values for the state-action value function: 


O't) + O:ti Ut{st, at) 

^ X (7) 

-|-/3niaxQt(st+i,a') - Qt{st,at)U 

where at G (0,1] is the learning rate specifying the step that 
the current state-action value is adjusted toward the TD sample 
u{st,at)-\-l3ma.Xa' Qk{st-viTO,')- Q-learning in (|7]) has been 
proved to be able to converge to the true optimal value of 
the state-action value function with a stationary deterministic 
policy, given that < oo and all 

actions in all states are visited with a non-zero probability 
m. The model-free property of Q-learning is reflected in 
the iterative approximation procedure for the Q-values, which 
does not require knowing the transition map Pr(s'|s, a) of the 
MDP in advance. 

The counterpart to Q-learning in the average-reward MDP 
is known as R-learning 1451 . In addition to learning the state- 
action value of the bias expressed in (|3), R-learning also needs 
to learn the estimate of the average reward h^. Therefore, R- 
learning is performed by a two-time scale learning process: 

Rt+i{st, at) Rtist, at) at (u{st, at) -\-TaaxRt{st+i, a') 

a' 

-ht-R{st,at)), 

( 8 ) 


^Due to the space limit, the conditions for the existence of a value function 
in the form of H) is not presented here. The readers are referred to (451 for 
the details. 
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ht+i-i^ht + 6t{u{st,at) + niaxi?t(st+i,a') - ht 

a' (-Q-i 

- maxRt{st, a')). 

a' 

In contrast to the value-iteration-based learning algorithms 
given in ©, (I8]l and (H, the decision-learning methods based 
on the Learning Automata (LA) allow an agent to directly 
learn the stationary randomized policy. Instead of updating the 
action according to the myopic optimal Q-value in discounted- 
reward MDP and bias-value in average-reward MDP, the LA 
directly updates the probabilities of actions based on the 
utility feedback |35l. Let the action probability vector at time 
instance t be 7r(f) = (7ri(f),..., 7r|_4| (f)), where |^| is the 
size of the action set. Then an LA-based algorithm should be 
able to achieve the following goal ll35l : 

TT* = maxi?[f(f)|7r(f), s(f)], (10) 

7r(t) 

where f is the value of environment response, and is usually 
generated based on the instantaneous reward ut as a normal¬ 
ized value (i.e., f G {0,1}). The general updating rule for LA 
can be expressed as follows 1461 : 

7r,(f-f l)=7r,(f)-f (l-f(f)) 

a{t)=ai, 

( 11 ) 

where / and g are the penalty and reward functions, respec¬ 
tively. Specifically, different forms of / and g lead to different 
learning schemes. Among them, it has been proved that the 
linear-reward-inaction (i.e., Lr-i) algorithm is guaranteed to 
achieve the e-optimal policies l47l . In l45l . the automaton¬ 
updating procedure based on Lr-i is adopted to learn the 
optimal jrolicy in the ergodic MDPs with average-reward 
objective^ In other works such as the optimal policy of 
the discounted-reward MDP is learned by adopting the Lr-i 
algorithm for policy updating and the standard Q-learning 
algorithm in O for Q-value estimation at the same time. 

Although the two groups of learning mechanisms, namely, 
value-iteration-based learning (e.g., TD-based learning such 
as Q-learning and R-learning) and LA-based learning appear 
distinct from each other, both of them can be considered as 
special cases in the framework of Actor-Critic (AC) learning 
B9l . In the context of AC learning, the concepts of value 
function and policy are also known as “critic” and “actor”, 
respectively. Since Q-learning and R-learning only learn a 
state-action value function and there is no explicit function 
for the policy, the two learning algorithms are also known 
as the critic-only algorithms. On the contrary, without using 
any form of a stored value function, LA can be considered 
an actor-only algorithm. Extending from these two special 
cases, a generalized AC-based mechanism keeps track of both 
the state-value function and the policy evolution at the same 
time. In this sense, a generalized AC-based mechanism is 
also known as combined payoff and strategy learning Et). 
Specifically, if the state-action value of the MDP is learned 
following the TD-based methods and in the meanwhile the 

^For the details of Lr_j, please refer to Section IV-AS I 



Fig. 5. Schematic view of the generalized AC algorithm. 

learning agent’s policy is updated following the LA-based 
methods, the AC-learning mechanism is also known as Actor- 
Critic LA (ACLA) |[50l . A typical rule for jointly updating 
the estimate of the state-value and policy in ACLA can be 
found in m- Here it is worth noting that for both critic and 
actor updating, the learning mechanisms are not limited to the 
aforementioned two categories of algorithms. For example, an 
on-policy learning algorithm, i.e., State-Action-Reward-State- 
Action (SARSAjj can be used to replace the Q-learning-based 
critic-updating mechanism, and instead of the LA-like actor¬ 
updating mechanism, policy gradient is widely used for actor 
updating i49i . A schematic overview of the generalized AC 
algorithm is given in Figure |5] 

B. Strategy Learning in the Loosely Coupled Multi-Agent 
System 

A stochastic Multi-Agent System (MAS) can be dehned 
by extending the 4-tuple Single-Agent MDP (SAMDP) (Def¬ 
inition [T]) into a 5-tuple Multi-Agent MDP (MAMDP): 
(Af, S, A, {un]n^M: Pr(s'|s, a)), in which J\f is the set of the 
decision-making agents, S=xSn is the Cartesian product of 
the local state spaces of all the agents and A= xAn is the 
Cartesian product of the local action spaces of all the agents. 
When considering the learning mechanism in an MAS, it is 
natural to simply adopt the standard SAS-learning algorithms 
by assuming that each agent is an independent learner with the 
local utility function a„). In doing so, the activities of 

the other agents are treated as part of a stationary environment 
and the learning agents update their policy without considering 
their interactions with the other agents. This approach enjoys 
popularity especially within the studies in the cooperative 
decision-making domain ll52l . Il53l . Its typical applications 
can be found in modeling the hunter-prey systems ll54l and 
team coordination ll^ . just to mention a few. However, it 
is important to note that multi-agent learning based on SAS 
learning requires the joint learning process to be decomposed 
into local ones. Thus, individual-agent behaviors are relatively 
disjoint, and the agents are able to ignore the information 
raised by the interactions with each other. This is also the 
reason for us to call it a “loosely coupled multi-agent system”. 
Otherwise, with concurrent learning, all the individual agents 

About the difference between Q-leaming and SARSA, the readers are 
refen'ed to ED for more details. 











need to adapt their policies in the dynamic context of the 
other learners, in which case the basic assumption of stationary 
environment for the single-agent scenarios will no longer hold. 

Although convergence of SAS-based learning is not guaran¬ 
teed in most of the practical MAS scenarios, attempts of gener¬ 
alizing the convergence condition for the SAS-based learning 
mechanism can still be found in the literature. By limiting 
the application scenarios to fully-cooperative MAMDPs (i.e., 
common-payoff MAMDPs), the convergence property of SAS- 
based learning with Greedy policy searching in the Limit of 
Inhnite Exploration (GLIE) for MAMDPs is discussed in ll56l : 

Proposition 1. For the multi-agent Q-learning schemes obey¬ 
ing the individual updating rule in (0 in a cooperative MAS 
system, assume that the following conditions are satisfied: 

• the learning rate at decreases over time such that 

Oft = 00 and a| < oo, 

• each agent samples each of its actions infinitely often, 

• the probability of agent i choosing action a G Ai is 
nonzero, 

• the probability of taking a non-optimal action decreases 
to 0 when t ^ oo during the exploration stage, 

let 7r*(t) be a random variable denoting the probability of 
action-taking in a (deterministic) equilibrium strategy profile 
being played at time t. Then for SAS-based learning, for any 
^,e > 0, there exists T{^,e) such that 

Pr(|<(f) - 1| < e) > l-C,Vf>T(^,e). (12) 

Although lacking a formal mathematical proof. Proposition 
[T] has been widely accepted in related studies 041 . OTl . A 
more general convergence condition for SAS-based learning 
in MAS scenarios is given by OSlI : 

Proposition 2. In an MAS environment, an agent following 
the updating rule in 0 will converge to the optimal response 
Q-function with probability 1 as long as all the other agents 
converge in behaviors with probability 1. If the agent follows 
a GLIE policy and its best response policy is unique, it will 
also converge in behavior with probability 1. 

Propositions [T] and |2] provide theoretical support for the 
convergence property of a number of SAS-based learning 
algorithms that can be considered a variation of 0 (e.g., 
distributed Q-learning in cooperative MAMDPs ll59l and pol¬ 
icy hill-climbing in two-agent MAMDPs Il60l ). Again, it is 
worth pointing out that for most MAS scenarios (e.g., general- 
sum stochastic games) convergence of SAS-based learning 
is not guaranteed. Furthermore, even when convergence can 
be reached, it usually takes a significant amount of time for 
merely determining switching between one pair of actions. As 
a results, most of the practical SAS-based learning mecha¬ 
nisms are limited in the special scenarios such as the fully- 
cooperative MAS or two-agent MAS. In the framework of 
the independent learning algorithm using standard Q-learning 
1561, other SAS-based learning algorithms for MAS usually try 
to eliminate the uncertainty caused by the actions of the other 
agents while still retaining the distributivity of the decision¬ 
making process. One typical example can be found in 15^ . 
which projects the global Q-table of a deterministic MAMDP 


(namely, the state transition is deterministic in the MDP) using 
centralized Q-learning with joint action a = (oi,...,a„), 
Q(s,a), to the local Q-table of agent i with only local 
action information a^, Q{s,ai). Following the standard Q- 
learning rule, the projection-based independent learning adopts 
an optimistic assumption that all the other agents will act 
optimally. However, the learning result of such a distributed 
algorithm is greedy with respect to the centralized Q-table with 
the joint action. Additionally, its convergence when extended 
to the scenarios of stochastic MAMDPs is not guaranteed 
since it cannot discern the influence of the behaviors of the 
other agents from that of the state dynamics. It is important to 
note that without explicit coordination, which is at the cost of 
losing the distributiveness of the decision-making process, all 
the independent-learning-based algorithms will suffer for the 
same reason as in the tightly coupled, MAS-based scenarios. 

Despite all the limitations of independent learning, one 
important benefit of adopting the disjoint learning processes 
in the MAS is that it creates the opportunities of experience 
sharing among individual agents. In El, ED, the “implicit 
imitation” mechanism by the observer agents is proposed to 
incorporate the experience of the expert agents in the MAS. 
Under the framework of distributed, independent MDPs, it is 
frequently assumed that the learning agents are analogous to 
each other in terms of state space, state transition and action 
set ED- Then experience transferring can be implemented 
by modifying the estimated state-action value of the observer 
agent based on the expertise evaluation of the mentor agents 
and the weighted combination of their respective Q-values 
El . When experience transferring is considered beyond the 
framework of model-free learning and the model-based policy¬ 
learning mechanism is adopted, the observer agent can also 
implement the experience learning by maintaining the es¬ 
timation of the mentor’s transition map from observation, 
and incorporating the estimation into its own value-iteration 
process ED- 

C. Multi-Agent Strategy Learning in the Context of Games 

In most of the practical scenarios, the dynamics of the multi¬ 
agent MDP (e.g., the transition probabilities and the local 
payoff) is determined by the joint policy of all the agents. To 
facilitate distributed policy learning, the multi-agent MDP is 
usually viewed as a Stochastic Game (SG). Mathematically, an 
SG shares exactly the same 5-tuple structure as an MAMDP, 
(AA,5,^, {M„}„g 7 ^,Pr(s'|s,a)). However, the goal of each 
agent in the SG is to maximize its individual payoff ifTSll . 
Based on the definition of SGs, a repeated game can be 
obtained as a 3-tuple, {J\f,A= xAn, {un}n^M), by fixing the 
environment state as invariant while maintaining the objective 
of each player as maximizing its individual discounted/average 
payoff over the infinite time horizon. In the repeated game, the 
system dynamics is reduced to only the mapping between the 
action and the payoff; Un ■ A^ M. Further, when the repeated 
game is played only once, it is reduced to a static game. In 
return, any single shot of an SG or a repeated game is a static 
game and is known as a single stage or one-shot game of the 
original game EH- 
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One important reason for adopting the game theoretic 
models lies in the requirements that decisions are to be 
made in a distributed manner with the limited ability of both 
information acquisition and action coordination. This may be 
either due to the overwhelming dimension of the state-action 
space as the number of agents grows, or due to the overhead 
for information exchange among agents. In the game-based 
decision-making model, the individual-rationality property of 
the agents leads to the concept of the best response. In an 
SG, the best response of agent n is defined as the policy 
{iTn = Pr(s,a„) : s G 5} such that the long-term payoff 
under local policy 7r„ is not worse than that under any other 
local policies: I4(7rn, 7r_„) > I4(7r^, 7r_„), given the joint 
adversary policy 7r_„. Here, 7r_„ is the joint strategy of 
the adversary agents except agent n and I4, can be either 
the discounted long-term payoff or the average long-term 
payoff. If Vn G Af, the policy is a best response to the 
joint strategy of the other agents, we say that the policy 
profile (tti, ..., 7r|jv/|) is a Nash Equilibrium (NE) lITSll . In the 
context of games, the goal of policy learning now becomes 
finding the policy updating rules for reaching a specific 
equilibrium. Apart from the most commonly used solution 
concept of NEs, a policy learning mechanism may resort to 
other types of equilibria for the convenience such as ensuring 
convergence or improving performance. In order to facilitate 
our discussion on different learning algorithms, we provide the 
formal definition of several equilibria in discounted-reward SG 
G = {^/,S,A,{un}nG^f,P^is'\s,a)) as follows: 

Definition 2 (Nash Equilibrium (NE)). In a game G, an NE 
point is a tuple of strategies (tt^' ,..., 7r*^|) such that Vs G S, 
Vn G A/” and V7r„ G n„, 

Vj3,„(s,7rj',.. . .,7rp^|)>‘E^,„(s,7rj;',.. .,7r„.. .,7ri*^|), 

in which V3_„(s, tt*, ..., Trj^l) is given by Q with a slight 
abuse of notation. 

Definition 3 (Correlated Equilibrium (CE)). In a game G, a 
CE point is a joint strategy tt* = (tt* , tt^^) such that Vn G Af, 
Vs G 5 and Van, a'n G An, 

^ ^ ^ (s, Un 5 2 (s, dji , ^ 

^ ^ ^ Oin, a—n)Qan, a—n), 

(a„,o_„)£.A 

in which QJ j(s,a„,a_„) is given by 0 with a slight abuse 
of notation and 7r*(s, a„, a_„) = 7r*(s, a_„|a„)7r*(s, an). 

Definition 4 (e-Equilibrium). Let e > 0, the profile tt* = 
{'^ni'^*-n) e-equilibrium of game G if by following tt* 

no player can improve its payoff by more than e at any stage. 

Specifically, given the condition of the NE (Definition 0, 
’^-n) e-NE if Vs € S, Vn € Af and Vtt^ G n„, 

V3.n(s,<,7r*„)> V3,„(s,7r„,7r*„) - e. 

Given the condition of CE (Definition\^, tt* = (7r*,7rl^) is 


an e-CE if Vs € S, Vn € Af and Van, a'n G An, 

^ ^ TT is, an, a—n)Q iip, an, a—n) ^ 

^ ^ {p,t^n,a—n)Q^^i{p,an,a—n) C. 

{an,a-ri)^A 

Based on Definitions |2]|4] the conditions of equilibria for 
repeated/static games can be obtained in a similar way. Erom 
the perspective of strategy derivation, a CE can be considered 
a generalized form of an NE since it does not require the 
individual player’s strategy to be independent with each other. 
Although the adoption of a CE is recognized as being able to 
provide a better performance of an NE, such a performance 
improvement is usually at the cost of introducing an arbitrator 
or coordinator into the game ESI. Erom the perspective of 
convergence reaching, an e-equilibrium can be considered a 
form of both NE and CE with relaxed condition. Eor learning 
algorithm design in repeated games, the introduction of e- 
equilibrium helps develop the learning mechanisms that guar¬ 
antee the convergence to near-equilibrium with a limit-inferior 
bound. However, it is worth noting that for a general SG, 
the existence of a stationary e-equilibrium is not guaranteed 
beyond the case of two-player SGs ll63l . 

According to the Eolk theorem ll3^ . for every infinite- 
horizon, n-player, discounted repeated/stochastic game with 
a finite number of actions, the existence of a stationary 
policy TT* as a subgame-perfect NE ED is guaranteed. By 
proving the existence of a subgame-perfect NE, the Eolk 
theorem implies that when compared with the static one- 
shot game, policy learning may be able to obtain a better 
payoff with the new NE in the repeated games. Such a benefit 
is also considered a major motivation for the engineers to 
adopt the game-based learning algorithms in the domain of 
distributed decision-making. However, the implementations of 
the learning algorithms heavily rely on the game structures 
and the forms of the equilibria, and may differ significantly. 
Within the past two decades, numerous methods have been 
proposed for strategy learning in games. In order to facilitate 
our survey on their applications in cognitive wireless networks, 
we categorize the model-free learning algorithms along the 
following dimension^ 

1) Value iteration vs. policy iteration: in SGs, most of 
the learning algorithms based on the state-action value 
estimation fall into the category of value-iteration based 
algorithms. These algorithms include minimax Q-learning 
ll64l . NSCP-learning ll^ Nash Q-learning Il66l . Nash 
R-learning ll67l and CE-Q learning l68l . In contrary to 
value-iteration-based learning, the policy-iteration-based 
learning algorithms directly update the action-probability 
vectors of each agent, using either the observation of the 
adversary agents’ action pattern or the payoff received 
from interaction with the environment. These algorithms 
include standard Eictitious Play (EP) 13^ . asynchronous 
best response ll6^ . LA-based learning algorithms (e.g., 

^All the game-based learning methods to be discussed in the following 
sections originate from these algorithms, and in Section fiTTl more details will 
be provided for each of them. 
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Lpi-i learning P7l and Bush-Mosteller learning fTOl ). 
gradient-play-based better reply Gil and no-regret learn¬ 
ing fn\ . In the cases when both the strategy and the 
local expected payoff are to be learned, the AC-like, 
multiple-timescale learning algorithms GS provide an 
efficient strategy-learning approach (e.g., stochastic FP 
Gll) for the agents. Further, when the joint action or the 
payoff of the adversary agents is not directly observable, 
conjecture-variation-based learning Gl works as an al¬ 
ternative way of the aforementioned learning algorithms. 
In the literature, these joint policy-value-iteration mech¬ 
anisms for games are also known as the COmbined fully 
Distributed PAyoff and Strategy-Reinforcement Learning 
(CODIPAS-RL) mechanisms Gll- 

2) NE vs. other equilibria: most of the learning algorithms 
in 1) such as proposed in ll47l . Il64l - ll67l . Il69l - GT1 
aim at finding the NE of the repeated games/SGs. By 
contrast, the goal of CE-Q learning ll68l and some no¬ 
regret learning algorithms GH is to learn the CE in 
the SG and the repeated game, respectively. By relaxing 
the condition of an NE from the prohle of real actions 
to the profile of agent beliefs, conjecture-variation-based 
learning Gl converges to the conjecture equilibrium 
GSl . In most practical scenarios based on the framework 
of general repeated games, EP and stochastic EP only 
guarantee that the e-equilibrium can be reached Gl- In 
the literature, e-equilibrium is sometimes known as the 
Logit equilibrium when the Logit functior@ is used for 
strategy updating. 

3) Noncooperative games, cooperative games and team 
games: technically, these three major categories cover 
most of the game-based models in the applications 
of distributed control. Provided that the noncooperative 
games satisfy certain properties (e.g., being supermod- 
ular/submodular or having a unique NE), all of 
the aforementioned learning algorithms in 1) and 2) 
may ensure to reach one of the equilibria in the game. 
Eor cooperative games, which are usually featured by 
the process of bargaining or coalition formation among 
agents, the Nash bargaining solution can be learned 
through EP GTl . A team game is defined as the game 
in which the agents share the common payoff function, 
thus considered as a fully cooperative case of the general 
SG-based games. Since every team game can be modeled 
as a potential game d, it is possible to apply best- 
response-based learning G21, stochastic EP Gbl or no¬ 
regret learning GSl to learn the NE of a repeated team 
game. In the case of team SGs, each agent can also be 
associated with one single learning automaton at one 
game state. Then by applying Lu-j learning a pure- 
strategy NE is guaranteed to be reached GS). 


Value Iteration 


Q/R-Learning 

SARSA 

MAS Q/R-Leaming 

Minimax Q-Leaming 

i Nash Q/R-Leaming 
CE-Q Learning 

NSCP-Learning 



Fig. 6. A quick summai'y of the model-free leai'ning algorithms. 


TABLE VI 

Brief characteristics of model-free learning mechanisms 


Learning 

Mechanism 

System Model 

Stability Property 

Q/R-learning 

Single-agent MDP 

Optimality learning 

SARSA 

Single-agent MDP 

Optimality learning 

MAS Q/R-leaming 

Multi-agent MDP 

Optimality learning 

Minimax Q- 

leaming 

Noncooperative SGs 

NE learning 

Nash Q/R-learning 

Noncooperative SGs 

NE learning 

CE Q-leaming 

Noncooperative SGs 

CE learning 

NSCP-leaming 

Noncooperative SGs 

NE learning 

FP 

Noncooperative 
SGs/repeated games 

e-Equilibrium learning 

Gradient play 

Noncooperative 
repeated games 

NE learning 

Asynchronous Best 
Response 

Noncooperative/Team 
repeated games 

NE learning 

LA 

Noncooperative/Team 
repeated Games 

e-equilibrium learning 

No-regret learning 

Noncooperative/Team 
repeated games/SGs 

NE/CE learning 

Actor-critic 

learning 

Single/Multi-agent 

MDP 

Optimality learning 

Stochastic FP 

Noncooperative 
repeated games 

NE learning 

Conjecture- 

variation-based 

learning 

Noncooperative 
SGs/repeated games 

e-equilibrium learning 


are categorized according to the experience updating approach 
(i.e., value iteration or policy iteration) that they apply. In 
Table |Vll we further summarize the characteristics of these 
learning mechanisms in terms of stability property and the 
system models (SAS, MAS and games) that they are built 
upon. Eigure |6] and Table |VI] together provide a quick sketch 
of the algorithms that are to be surveyed with respect to their 
applications in cognitive wireless networks. More details of the 
characteristics of each learning mechanism will be provided 
in the following sections. 

III. Applications of Single-Agent-Based Learning 
IN Cognitive Wireless Networks 


D. A Summary of Model-Free Learning Algorithms 

Before proceeding to the next section, we provide a sum¬ 
mary of the learning mechanisms that have been introduced in 
this section in Eigure |6] In Eigure |6] the learning mechanisms 

® About the definition of a Logit function, please refer to Section IV-A3I 


Thanks to the property of self-organization, a model-free 
learner is able to reduce the level of required a-priori knowl¬ 
edge about the network model as well as the level of overhead 
due to explicit information exchange. It is also possible for 
the learner to adapt quickly to the changes of the network 
environment. As a result, model-free learning is particularly 
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TABLE VII 

Summary of the main notations in Section Hi-AI 


Symbol 

Meaning 

o 

A finite set of observation states in a partially observ¬ 
able Markov decision process 

O 

A single observation state in a partially observable 
Markov decision process 

tH(s) 

A weighting function to map a set of states s to a new 
state for state abstraction 

ct(s, a) or c 

Instantaneous cost function of a constrained MDP 

A 

Lagrange multiplier 


suitable for resource management and scheduling problems 
that demand self-exploration and self-organization of the net¬ 
work devices. Starting from this section, we will provide 
a comprehensive survey on the applications of model-free 
learning across different protocol layers in cognitive wireless 
networks following the broad-sense definition of CRNs. In a 
nutshell, the survey on the applications of model-free learning 
is organized based on the categorization of the learning mech¬ 
anisms that is provided in Section |II] According to the three 
types of mathematical models for decision-making. Sections 
mil m and |V] are devoted to the applications of learning 
algorithms based on single-agent systems, loosely coupled 
multi-agent systems and game-based multi-agent systems, 
respectively. The notations used in this section are summarized 
in Table IWl 

A. Applications of Learning in Single-Agent Systems 

The early attempts in applying learning algorithms to wire¬ 
less networking problems appeared even before the concept 
of cognitive radios was proposed. Generally, the a-priori 
knowledge of the environment evolution dynamics (e.g., the 
transition probabilities of the MDPs) is not required by 
the MDP-based, value-iteration learning schemes. Thus, the 
schemes are widely applied to the problems in the time- 
varying dynamics of the wireless environment that cannot be 
perfectly sensed. These problems include dynamic packet rout¬ 
ing lISOll . Dynamic Channel Assignment (DCA) IHTI . 1821 and 
joint radio resource management for multi-rate transmission 
control in WCDMA networks l83l . just to mention a few. The 
strategy-learning schemes in these studies are featured by a 
single/centralized agent, and are usually based on the standard 
Q-learning algorithm given in (|2l). In early studies, the learning 
schemes are built upon the simplified system models. Thus, 
the issues such as the convergence conditions of the learning 
schemes are still not the focus of the discussion. As a result, 
the existence of Markovian property is simply assumed in 
most of these works ED-lSl. Also, in order to reduce the 
complexity of the system model, the original MDPs modeling 
the network dynamics are usually transformed into new MDPs 
with reduced state-action space using state abstraction l84l or 
Q-table projection methods. However, the equivalence between 
the original MDPs and the re-transformed MDPs is generally 
not guaranteed (see the example of 1831). In most of these 
works (e.g., il, ED), the learning rules are designed in a 
heuristic manner. Sometimes the standard Q-learning schemes 
are modified by introducing the neural networks in order to 
represent the table of the state-action values and approximate 
the Q-value-updating function EOl . E2l . E3l . With these 


simplifications, the convergence to an optimal strategy of the 
learning schemes in these studies is also not guaranteed. 

Among different approaches for simplifying the MDP-based 
model of the network-control process, state abstraction l84l 
becomes a necessary way of trading off optimality for the 
efficiency of the single-agent-based learning mechanisms. The 
necessity of state-action-space reduction lies in the need for 
computational tractability of the learning schemes in the case 
of state-action-space explosion. This is especially necessary 
when a single agent is learning the strategy from a large set 
of candidate actions in a system with a huge number of states. 
In the context of networking problems, state abstraction maps 
an original network-control model based on one MDP into 
a new MDP with a smaller state-action set. Mathematically, 
state abstraction in MDPs can be defined as follows: 


Definition 5 (State abstraction ED)- For two MDPs M = 
(5, ,4, u, Pr(s'|s, a)) and M = (5, u, Pr(s'|s, a)), f : 
S ^ S is such a mapping that {(/)“^(s)|s S 5} partitions 
the state space S. Define a weighting function ru : 5 —>■ [0,1], 
where Vs S >5, w{s) = 1. M is an abstracted MDP 

of M, if the following conditions are satisfied: 

u{s,a)= w{s)u{s,a), (13) 

se0“i(s) 


and 


Pr(s'|s, a)= ^ ^ w(s) Pr(s'|s, a). (14) 

s'G0-i(s') se<^-i(s) 

However, the state-abstraction method generally requires 
that the state transition in the new MDP with reduced com¬ 
plexity to be well-defined. Namely, the linear-combination- 
based mapping in (fOl) and (fT4l) needs to be established and 
the condition Pr(s'|s, a) = 1 needs to be satisfied. Since 
with model-free learning, the transition models are generally 
not known, it will be practically impossible to obtain an 
accurate model of the reduced MDP. In order to address such 
an issue, approximate abstraction is proposed in ESI, ESI. 
In ESI, ESI, an on-policy reinforcement learning method, 
SARSA, is applied to the DCA problem in a multi-cell, multi¬ 
channel network with the consideration of handoffs. In the 
considered cellular network, N cells provide M channels to 
mobile stations, thus forming an Nx{M-\-T) xM state-action 
set. The arbitrary state-aggregation method proposed in ESI, 
ESI aggregates the rarely encountered states by reducing the 
size of the channel state space to a fraction of the total number 
of the channels. The state variable representing the number 
of currently allocated channels is also excluded, which leads 
to a 98% reduction from the original state-action space. A 
more complicated state-action-space abstraction method can 
be found in ESI . It adopts the feature extraction method 
and maps the original state vector based on four dimensions, 
namely, the mean and variance of the interference from the 
existing connections, the transmission type and the required 
transmission rate, into a vector of the resultant interference 
profile. The feature extraction method is further adopted in 
stochastic-game-based modeling for strategy learning in CRNs 
E3, EH- In ED, El, the central spectrum moderator 
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Fig. 7. Real-time video streaming process (adapted from ED). 



Fig. 8. The operation and message exchange in the layered MDP (adapted 
from dU). 


allocates the transmission opportunities to the CRs through an 
iterative, second-price auction (see IISl for the definition of 
auctions), whose dynamics is jointly determined by the Signal- 
to-Noise-Ratio (SNR) of the channels and the buffer states 
of all the CRs. In ll87l . Il88l . multi-stage bidding is adopted. 
Since for each CR, the value of tax to be paid for using 
the channels are based on the inconvenience it causes to the 
other CRs, the individual CRs use their local tax announced 
by the central spectrum moderator to classify the channel- 
buffer states that the other (adversary) CRs are in. Therefore, 
individual CRs only need to exchange the pricing information 
with the central spectrum moderator, and no extra information 
exchange between the CRs is required. In these works, the 
feature extraction method does not only achieve the goal of 
state abstraction, but also help avoid the explicit information 
exchange between individual CRs. 

With the development of MDP-based modeling in different 
protocol layers of the wireless networks (see examples in 
MAC layer link layer and application layer ll92l ). 
the SAS-based learning mechanisms in the cognitive wireless 
networks also gain more capabilities in addressing the radio 


resource management problems. In ll89l . the problem of real¬ 
time video transmission over a single-hop, slow-varying flat 
fading channel is formulated as a systematic layered MDP 
(see Figure |7] and Figure |8] for a schematic view of the 
system and the corresponding layered MDP model). With the 
proposed problem formulation, the discrete system state is 
composed of three components, i.e., the SNR as the channel 
state in the PHY layer, the transmission opportunity as the 
state of the MAC layer and the amount of both the incoming 
traffic and the buffered packets as the state of the applica¬ 
tion layer (see Figure |8]l. The evolution of the joint state 
{sAPP, sphy) is modeled as a Markov chain controlled by 
the joint action {aAPP,aMAC,cipHY), in which umac is 
composed of two internal actions bppY and Bmac- The joint 
action is determined by the power allocation, the channel 
resource payment made to the spectrum moderator and the 
packet scheduling algorithm. The cross-layer management of 
packet transmission is formulated as a layered MDP. This 
is because for the Bellman optimality equation of the state 
value, the Dynamic Programming (DP) based expression can 
be decomposed into a two-loop DP-based optimization. In the 
two-loop optimization, it is assumed that both layers have 
access to the global state in each time slot. The inner loop 
(i.e., the application-layer optimization) only needs to know 
the joint MAC-application action and the reported state value 
of the PHY layer for policy updating, while the outer loop 
(i.e., the PHY-layer optimization) only needs to know the 
PHY-layer action information and the reported state value 
from the application layer for policy updating. The layered 
Q-learning ll^ can be applied to learn the optimal strategy 
for transmission, with the standard Q-value updating rule in (17]l 
modified in each layer by incorporating the estimated Q-value 
from the other layer into the estimation of the local Q-values. 

Apart from lacking the a-priori knowledge about the statis¬ 
tics of the underlying Markov process, the decision-making 
entity in the network may frequently face the constraints on the 
available resources. To tackle these constrained radio resource 
allocation/scheduling problems, the unconstrained MDP mod¬ 
els are extended to the Constrained MDPs (CMDPs), based 
on which, modified reinforcement learning algorithms are 
also proposed 1941 - 1991 . Mathematically, a CMDP is de¬ 
fined by expanding the 4-tuple MDP model (Definition [T]| 
to be a 5-tuple, (5, yf, u, c, Pr(s'|s, a)), with the additional 
cost/constraint element c nooi. Taking the average-reward 
CMDP as an example, a generic CMDP optimization problem 
can be stated as follows: 


1 1 

max h^{s)= lim sup—Ut(st,at)>, 

" ^ J 

1 ] 

s.t. C^{s)= lim sup— '^ct{st,at) ><(7, 


<=o 


(15) 


According to Theorem 12.7 of 11001 . we have the following 
theorem for the average-reward CMDP: 


Theorem 1. If the underlying Markov chain of the CMDP, 
(5, .4, u, c, Pr(s'|s, a)), is unichain and the sequence of the 
immediate cost Ct is bounded below and satisfies the following 
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growth condition: 

for c : /C —>■ R there exists a sequence of increasing 
compact subsets JCi of K. such that U^/Ci = /C and 
limi^oo inf{c(K); k ^ /Ci} = oo, 
then there exists an optimal Lagrange multiplier X* such 
that the optimal solution of the CMDP is equivalent to the 
optimal solution of the unconstrained MDP, (5, A, g = u — 
\*c, Pr(s'|s, a)). 

According to Theorem [1] the non-structured learning 
schemes for the unconstrained MDP based on the Lagrangian 
dual function can be developed for solving the resource 
management/scheduling problems in the form of both R- 
learning iH, HD and Q-learning (H, ED, ifToTl . 

depending on the form of the reward/cost of the CMDP. 
Apart from the primal-dual equivalence based solution, it is 
also possible to develop constrained learning algorithms by 
exploiting the structure of the specific problems. The special 
structure is featured by the convexity of the objective and 
constraint functions in the original CMDP, or the modularity 
of the objective or the constraint functions [98], II102I . When 
certain structural property of the network control problems 
is satisfied (specifically, when both the instantaneous payoff 
and the constraint cost are multi-modular), the constrained 
structured-learning algorithm can be applied in the form of 
primal projection or submodular parameterization II102I . 

In addition to not knowing the environment evolution dy¬ 
namics and being limited by the resource constraints, the 
learning agents in a wireless network may also lack the 
ability of complete state-information acquisition. This can 
be a common issue in scenarios such as DSA networks, in 
which the secondary devices lack the capability of performing 
full-spectrum sensing due to the limited number of antennas 
II109I . The common approach to handle such a problem is to 
model the radio resource management problem as a Partially 
Observable Markov Decision Process (POMDP). Extending 
from Dehnition [T] an unconstrained POMDP can be dehned 
as a 6-tuple, (5, u, Pr(s'|s, a), Pr(o|s, a)), in which O is 
the set of observations o, and Pr(o|s, a) denotes the mapping 
probability between the system states and the observations. 
Instead of directly observing the state information of s, the 
learning agent can only obtain the network observation o. 
In the POMDPs, the random process associated with the 
observation is no longer a Markov process. A standard model- 
based solution to the POMDP is to convert the recorded state 
observations into belief states, and obtain a new unconstrained 
MDP with a continuous state space of the belief states. 
However, when the state-transition and the state-observation 
mapping is unknown, the TD-based learning schemes cannot 
be directly used for learning the optimal strategies of the 
POMDPs. Instead, other learning algorithms such as actor- 
critic learning 01101 and policy-gradient-based learning 01 111 
are applied. In Em), a delay-constrained least-cost routing 
problem in MANETs is modeled as a POMDP, the belief 
state of which captures the link-delay uncertainty due to the 
imprecise link state information. The belief-policy mapping 
is considered as a parametric function, the policy parameter 
of which is learned through a standard actor-critic learning 


method. In Il07l . to solve the DSA problem in a CRN, the 
channel access process of the Secondary Users (SUs) is hrst 
modeled as a constrained POMDP. In the constrained POMDP, 
a reward function is used to collect the instantaneous reward of 
the SUs, while a cost function reflects the instantaneous cost 
of the Primary Users (PUs) due to the channel interference 
from the SUs. The partial observation in the problem comes 
from the imperfect spectrum sensing of the SUs over the pri¬ 
mary channel state. After converting the original constrained 
POMDP into an unconstrained POMDP with the help of the 
Lagrange multiplier, the learning algorithm based on policy 
gradient imi is applied for finding a local optimal policy. 

To summarize this section, we categorize in Table IVIIII 
the aforementioned works (and some more) on SAS learning 
according to the networking applications that they focus on. As 
shown by Table IVIIII the SAS-based learning algorithms are 
powerful in addressing a number of radio resource allocation 
problems, as long as they can be formulated as a single-link- 
centric one. However, it is worth noting that although the 
theoretical support for the convergence of the SAS-learning 
schemes has been well studied, such an issue still needs to be 
addressed under practical circumstances. 

IV. Applications of Learning Based on Loosely 
Coupled Multi-Agent Systems 

The multi-agent learning scheme naturally leads to the 
framework of distributed decision making, thus the possibility 
of self-organization without a dedicated central coordinator. 
Therefore, it is considered especially appropriate for the net¬ 
work management problems in the CRNs, device-to-device 
(D2D) networks, heterogeneous networks (HETNETs) and ad- 
hoc networks, as long as the networks consist of multiple 
independent decision-making entities. However, although the 
framework of distributed decision making naturally leads to 
the consideration of adopting the multi-agent decision learn¬ 
ing scheme for network control, it is worth noting that for 
most cases it may be difficult to directly adopt the learning 
mechanisms based on the loosely coupled MAS by simply 
ignoring the interactions between the network entities and treat 
each of them as an independent learner. Due to the existence 
of device interaction, it is necessary to carefully investigate 
into both the advantage and the limitation of formulating 
a distributed network control problem as a loosely coupled 
MAS. Eurthermore, when adopting the model of learning in 
the loosely coupled MAS, it is still necessary to check to 
what level the information exchange between the learning 
agents is needed, and in what ways it can help improving 
the performance of the network. 

The new notations used in this section are summarized by 
Table |lX] 

A. Applications of Distributed Learning Based on the Model 
of Loosely Coupled Multi-Agent Systems 

Eor distributed learning in wireless networks, it is usually 
difficult to definitely classify between a non-game-based, 
multi-agent decision learning scheme and an SG-based learn¬ 
ing scheme. The reason for this lies in the inherited nature of 
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TABLE VIII 

Applications of SAS-based learning schemes in cognitive wireless networks: a summary 


Network Type 

Application 

Problem 

Formulation 

Reference 

Learning Scheme 

Learning Scheme Varia¬ 
tion 

Convergence 

Cellular 

Dynamic channel 
allocation 

MDP 18ll~ 

(Hal, (m, 

1861. semi- 
MDP fT03l 

iooljod 

Q-leaining 1811. 1821. 
fT03l. SARSA (§3, 

ID 

Neural Network 1821. 
State abstraction 1851. 
(H, N/A [Toll 

N/A 

Multirate transmis¬ 
sion control 

MDP 


Q-leaiming 

Neural network with fea¬ 
ture extraction [831 

N/A 

Call admission con¬ 
trol 

CMDP 

tm, 

fion 

Q-leaming [94|. 

fioTI 

State abstraction 1941. 

fioTI 

N/A 

Joint admission- 

bandwidth control 

CMDP 

(99), 

fT04l 

Q-leai'ning 110411. 

Neural network |991 

N/A 

N/A 

Single link 

Cross-layer 
resource allocation 

Layered- 

MDP 

189J, 193J 

Layered Q-learning 

Virtual experience tuples 

m 

N/A 

Scheduling- 
admission control 

CMDP 


Stochastic sub¬ 

gradient 

N/A 

Deterministic optimal 
policy 1^ 

V-BLAST power- 
rate control 

CMDP 

“[SU 

Q-leaming 

Constrained structured Q- 
leaming 

Randomized optimal 
policy 

MANETs 

QoS routing 

POMDP 

Ton 

Actor-critic learning 

N/A 

N/A 

CRNs 

Dynamic spectrum 
access 

liiii 

|95], 

(m, 

floel. 

flOTl 

Actor-critic learning 
11061. R-leaming 

m. Wt\. policy 
gradient 11071 

N/A l95l. 11061. 11071. ~ 

Arbitrary state reduction 

mi 

Deterministic optimal 
policy (m, N/A 1^. 
11061. Local optimum 
policy 11071 

HETNETs 

Vertical handoff 

CMDP 

“in 

Q-leaming 

N/A 

Optimal randomized 
policy 

Admission control 

MDP 

“nsa 

Q-learning 

Q-leaiming based on 

neural-fuzzy-inference 

network 

N/A 


TABLE IX 

Summary of the new notations in SectionHvI 


Symbol 

Meaning 

i.p' 

7r’ 

The received SINR for femto/pico link i over resource 
block r 

piP 
r J. 

The transmit power of femto/pico BS 

^%%.r 

The link gain between the femto/pico BS and its user 


The link gain between the macro BS and the femto/pico 
user 


noise power 

I[x\ or 

The indicator function 

Wi{j) oF 

w'iij) 

The weight assigned by agent i for its neighbor j’s 
instantaneous reward or estimated state value 

Y 

The social reward of a group of agents 

y 

The private reward that an individual agent chooses 


strategy coupling in most of the practical networking problem 
setups. One typical example is illustrated in Una, ma, 
which consider that L macrocells and N femtocells/picocells 
operate over the same frequency band (see Figure |9]) in a 
HETNET. In order to develop a self-organized power allo¬ 
cation scheme for the downlink transmission in the HETNET, 
the Shannon capacity of a link is considered as the individ¬ 
ual utility of a cell, which is a function of the Signal-to- 
Interference-plus-Noise-Ratio (SINR) of the transmitting link 
in that cell. Take the femtocells/picocells as an example, when 
both the intra-cell interference and the cross-tier interference 
are considered, for femotocell/picocell link i, the SINR at the 
receiver is determined as follows: 


= 


pi,F FF 
r 


El 


pjM^MF I , 

-L r Q^ ~r / . 1 —1 1—r t 


(16) 


where is the transmit power of femto/pico Base Station 
(BS) i over the resource block r, is the link gain between 



Fig. 9. Structure of a HETNET with both inter-cell and cross-layer inter¬ 
ference. A HETNET is featured by the hierarchy in the network structure, 
which is comprised by the high-power, high-capacity, wide-range macrocells 
and the low-power, low-capacity, small-range femtocells/picocells fml . 

the femto/pico BS and its user, is the link gain between 
macro BS j and the femto/pico user, gj^f^ is the link gain from 
another femto/pico BS k to the user of femto/pico BS i, and 
is the noise power. 

Apparently, the capacity of femto/pico link i is determined 
not only by the transmit power of femto/pico BS i, , but 
also by the inter-cell interference and the 

cross-tier interference E^=i Pr^gfll Therefore, the private 
utility of femto/pico link i is also a function of the strategy of 
the other femto/pico BS k (k = 1,... ,n,k ^ i) and all the 
macro BSs j (j = 1,... ,L). The goal of the local cells for 
maximizing the individual utilities conflicts with each other. 
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and it is difficult to decompose the strategy coupling between 
the cells. As a result, many works formulate the same problem 
as a noncooperative repeated game ifTTSl . ni6l . However, 
it is still possible to tackle such a power control problem 
by treating the strategies of the other BSs as part of the 
environment dynamics. For example, in Ml 131 the system state 
from the perspective of a femto/pico link is designed as a 
binary one; 

tf i < TTh) (17) 

* \ 0, otherwise, 

which is based on hard thresholding (compared with the 
permitted SINR given as 7xh) of the macrocell user with the 
interference from the femto/pico links. The similar network- 
state formulation can be found in the other works such as 
na. By adopting a standard Q-learning scheme based on 
the assumption of independent state-value evolution, it is 
assumed in mni, dm that the dynamics of the aggregated 
interference to the macrocell user is a stationary Markov 
process. Consequently all the strategies of the other femto/pico 
users are treated as stationary ones hence part of the wireless 
environment. In most of the cases, such a formulation/solution 
with the distributed MDPs and independent Q-learning algo¬ 
rithm may not guarantee the convergence to any equilibrium. 
However, empirical studies show that when using the dis¬ 
tributed Q-learning scheme, convergence can still be achieved 
given a sufficiently large number of iterations cni, mni, 
ma, and the distributed Q-learning algorithm is also able to 
achieve a better performance compared with the non-adaptive 
algorithms El, ED, GM- Although not mathematically 
proved, one possible explanation for such a result may lie 
in Proposition |2] since one independent Q-learning agent is 
always able to converge as long as the other agents happen to 
converge in behavior. 

Generally, for the network management problems with 
strategy coupling, directly adopting the distributed learning 
schemes in the loosely coupled MAS (e.g., multi-agent learn¬ 
ing in the form of distributed, independent Q-learning) can 
be considered as an approach that trades off the certainty of 
algorithm convergence for the simplicity of system analysis 
and learning-rule design. Except for heterogeneous networks, 
applications that follow such a design pattern can be found 
in the problem formulation such as distributed DSA with the 
SU collisions Ell-Ell, power allocation in the overlay, 
cognitive wireless mesh network Ea and dynamic spectrum 
management in 4G cellular networks II123I . Although with 
many studies that adopt such a design pattern for the learning 
schemes, it is important to reiterate that overlooking strategy 
coupling may result in poor performance of each learning 
agent. In 01241 . a problem of DSA management with 2 SUs 
over 2 primary channels (see Figure [T0]l is used to exemplify 
how the lack of coordination between individual agents may 
impact the agent performance. In 0I24I . the availability of a 
primary channel is modeled as a two-state discrete Markov 
chain. The SUs try to access the idle primary channel while 
avoiding the collision with the other SU. The adaptation of the 
channel-access strategies is formulated as a POMDP, in which 
the observation of an SU includes 3 states; busy, collision and 



Fig. 10. Illustration of the interference map for the two-SU-two-PU DSA 
network ESS. 

success. Based on the assumption that the presence of the 
other SU can be ignored, a model-based single-user approach 
for strategy updating is proposed. When compared with the 
cooperative approach, which allows the SUs to exchange 
their belief state vectors of the POMDP, the performance of 
the single-user-based approach is shown to be significantly 
inferior. Moreover, the simulation results in M124M show that the 
performance of the single-agent-based approach is even worse 
than that of the deterministic channel-assignment scheme, 
which indicates that in the situation of strategy coupling, 
allowing some degree of cooperation will be essential. 

In order to balance between the simplicity of the learning 
mechanism (namely, the distributiveness of strategy learning) 
and the optimality of the learning algorithm, careful modeling 
is needed with respect to different network scenarios. In MI25M . 
a set of decision-learning mechanisms based on distributed 
Q-Iearning is adopted for a scalable DSA mechanism in an 
overlay CRN. The goal of the learning mechanism design 
is to obtain the near-optimal strategies without the explicit 
coordination among the SUs. It is shown in flI25M that by 
properly designing the private/local objective functions of the 
individual SUs, the needs of both agent coordination and 
distributed decision-learning can be fulfilled. In M125I . the 
SUs are assumed to share the temporarily free band roughly 
equally. It means that the reward of an individual SU with 
DSA, Ui{t), is approximately equal to the average of the social 
reward of all the SUs that attempt to use the same primary 
band (denoted by Y (/)); 


u^{t) = 


|A/i (/) I -f 1 


Yit) = 




IW(t)H-i 


(18) 


where Afi{t) is the set of SUs that interfere with SU i over 
the same band at time t. The PU activity is also modeled as a 
two-state Markov chain. In MI25I . two guidelines are proposed 
for designing the private/individual objective function of each 
SU; 


1) alignedness, which reflects agent coordination, and the 
full alignedness requires the SUs not working against 
each other when maximizing their own private objectives; 

2) sensitivity, which reflects the efficiency of the individual 
learning processes and requires the SUs to be able to 
discern the impact of their own action changes so as to 
learn about the better local strategies fast enough. 

In MI25I . the measurable indices of “factoredness” and 
“learnability” are introduced to measure alignedness and sen- 
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sitivity of the private objective function, respectively. Denoting 
the selected private objective function as yi {yi may not be the 
same as Ui) and the joint deterministic strategy by the SUs over 
the same band as tt = (tt^, 7r_i), the degree of factoredness and 
learnability can be expressed as in ( fT9l l and (l20l i. respectively: 


E.. E._. 1 


(19) 

( 20 ) 


where I\x\ is the indicator function, I\x\ = 1 if x > 0 and 
I[x\ = 0 otherwise, and Fy. (0 < Fy^ < 1) measures the 
consistence between the local objective and the social payoff. 
The higher the degree of factoredness (i.e., the value of Fy^) 
is, the more likely a change of the local action by SU i 
will have the same impact on both its private reward and the 
global reward. Li^y^ measures the sensitivity of local reward 
to the local action changes. According to (l20l i. the higher the 
sensitivity (i.e., the learnability), the more the dependence of 
yi{Tr) on the local actions of SU i. By employing the property 
described by (fTsl l. namely, the private reward yi{t) = Ui{t) 
being proportional to the global reward Y{t), it is shown 
in il25l that a good objective function can be obtained by 
removing from Y{t) the effects of all SUs other than SU 
i. A general form of such a local objective function can be 
expressed as follows: 


D,{ir) = r(7r) - y(7r_,). (21) 

Since Ui{t) is a function of both Y (t) and the cardinality of the 
interfering-SU set Afi{t), all that SU i needs to obtain the value 
of Di is to estimate \Afi{t)\ given the information that SU i 
observes locally. It is shown that with the proposed objective 
function ( 1211 1 . the distributed learning scheme achieves better 
spectrum efficiency than those learning with both private re¬ 
ward and global reward. From the game theoretic perspective, 
spectrum access with the individual reward as in (fT¥l l can be 
interpreted as a cardinal potential game ifTSl . in which (l2Tli 
is in the exact form of a potential function. In this sense, the 
design of the objective function in II125I can be considered as 
a special case of global-reward-based learning, and may not 
be easily extended to a general radio resource management 
problem such as CH, El. Although the two indices in 
(fT^ and (I 20 I 1 provide important guidelines on individual utility 
function design for distributed learning, it is still needed to 
find appropriate approaches other than that given by (fTsT i for 
the networking applications which cannot be modeled as a 
potential game. 

Instead of designing a different objective function, the 
learning scheme itself can also be tailored to meet the require¬ 
ment of radio resource management. One example of learning 
scheme design in the strategy-coupling scenario is provided by 
El, which studies an Aloha-like spectrum access scheme 
without any negotiation in a multi-user, multi-channel CRN 
(Figure [nj. In El, N primary channels are modeled as 
N independent, two-state Markov chains, while the SUs are 
assumed to have no mutual communication and need to learn 
the collision-avoidance strategies online. Instead of adopting 



Channel 4 

Fig. 11. Channel access competition and conflict in an Aloha-like multi-user- 
multi-channel CR system fT^ . 

the standard state-value evolution model given in (HI and 
the TD-based strategy-learning mechanism given in (|7]i, the 
expected one-time reward is adopted as (l22ll : 

Qij = E[u^\ai(t) = j,s{t) = s], (22) 

and a learning mechanism without considering the future 
reward is designed as (|2^ : 

In (|22 T i and (l23T l. ai{t) = j represents the action of SU i to 
select channel j for transmission, s is the vector of the channel 
states, is the learning step, Ui{t) is the instantaneous 

reward of SU i and I{x,y) is the indicator function (i.e., 
I{x,y) = Q if X ^ y and I{x,y) = 1 if x — y). Although 
(|2^ appears in a similar form to distributed Q-learning, it 
is derived based on the analysis of the channel contention 
as an SG. It is shown in 11261 that with the Boltzmann 
distribution-based strategy exploration, the learning scheme in 
(I 23 ]) is equivalent to the Robbins-Monro iteration 01271 and 
converges asymptotically to a stationary point (i.e., an NE) 
with probability one. 

Generally, the aforementioned multi-agent learning schemes 
can be divided into two categories, namely, distributed learning 
based on the assumption of purely independent state-value 
evolution (e.g., El, El, E3-E1) and distributed 
learning based on the structural property of the specific re¬ 
source management problems (e.g., Ifni . El l. Although 
both of them do not require explicit information exchange 
among network devices, sometimes introducing a certain level 
of information exchange (at the cost of more overhead) can 
help improve the network performance. In the literature, the 
learning schemes with explicit information exchange is usually 
referred to as learning based on Distributed Value Function 
(DVF). With DVF, local devices are required to share their 
state-value/reward functions with the neighbors. Instead of 
learning the Q-value based on the individual reward or local 
state values, individual decision making aims at the maxi¬ 
mization of both the local and the neighbors’ weighted sum 
of rewards/state-values. By modifying (|7]), a typical learning 
mechanism with DVF can be expressed as 

di) (1 — at)Q\{si, ai)-|- 

at[u\{si,ai) +13 Y. 

\ jeN(i) / 
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TABLE X 

Applications of independent-learner learning schemes in cognitive wireless networks: a summary 


Network Type 

Application 

Reference 

Strategy Coupling As¬ 
sumptions 

Learning Scheme 

Convergence 

CRNs 

Aggregated interference 

control 

“nm 

None 

Independent Q-leaming for MDP 
and neural network for POMDP 

fUTl 

N/A 

Joint spectrum and power 
management 

EMT 

[ml 

None 11191, coupling as 
a noncooperative game 

fT22l 

Independent Q-leaiming 11191, 

□all 

N/A 

Dynamic spectrum access 

ill 1 

T20J, 

None [12Uj, 11231, cou¬ 
pling with fully con¬ 
nected topology 11251. 
Coupling as a noncoop¬ 
erative game 11261 

Independent Q-leaming IT^Oj, 

Win-or-Leam-Fast (WoLF) 

11231, Independent learning 
(unspecified) 11251. Modified 
independent Q-learaing without 
considering the future states 

fT26l 

TJ7A [TlO'l, 

11231. near 

optimal strategy 
fl25l or NE 

ma 

Joint sensing-time and 
power allocation 

[121) 

None 

Independent Q-leaming 

N/A 

HETNETs 

Femto-user power alloca¬ 
tion 

~~rn^ 

None 

Independent Q-learning 

N/A 

Inter-cell interference coor¬ 
dination 

TUI 

None 

Independent Q-leaming 

N/A 

Sensor 

networks 

Coverage and energy con¬ 
sumption management 

~~n28i 

Coordinated decision¬ 
making 

DVF-based learning 

N/A 

Cooperative 

networks 

Power and relaying proba¬ 
bility management 

“imi 

Coordinated decision¬ 
making 

Q-leaming based on distributed 
reward and value function. 

Local optimal 
point 

Cellular 

networks 

Power allocation and expe¬ 
rience sharing 

~~rmi 

Coordinated decision¬ 
making 

DVF-based learning 

N/A 


in which J\f{i) is the set of device I’s neighbors (including 
i) and Wi{j) is the weight that determines the contribution of 
device j’s state-value to device i’s estimation of Vi. 

The applications of the DVF-based learning mechanism 
in wireless networks can be found in II 1281 - 111301 . In II128I . 
DVF-based learning is used in an ad-hoc sensor network to 
coordinate the sensing and hibernation operation as the state of 
the grid-point coverage changes. To encourage the sensor node 
with a larger coverage area to perform the sensing operation, 
the individual reward is designed as a function of the number 
of the covered grid points. It is shown that DVF-based learning 
outperforms the independent learner-based learning algorithm, 
especially under the condition of high sensor node densities. 
In III29L a learning algorithm based on the exchange of both 
the instantaneous reward and the estimated local state-value 
is proposed for the joint power control and relay selection 
in a distributed cooperative network. The proposed learning 
scheme is featured by weighting over both the instantaneous 
reward and the estimated local state-value that are shared 
by the neighbor nodes, and thus is called learning with the 
Distributed Reward and Value (DRV) function. By extending 
(I 24 I 1 . the rule of learning with DRV can be expressed as 
follows: 

— at)Ql{si,ai)+ 

at ( E w[{j)u*{sj,aj)+P E 

\jeAr(i) ieV(i) / 

in which w'(j) and Wi{j) are the weight of node i given 
to its neighbor j’s instantaneous reward and estimated state 
value, respectively. With the learning scheme given in (l25l l. 
each node in the network maintains a vector of both the 
channeFbuffer state of its direct link and the channeFbuffer 
state of its cooperative link. It is shown in 01291 that learning 


based on sharing both the instantaneous rewards and the local 
state values can achieve a better power efficiency than that 
using only the local reward or the local state value information. 
In O 130 I . the DVF-based learning scheme is adopted in a real¬ 
time multimedia cellular network to adapt the power allocation 
of interfering links. In addition to coordinating the individual 
links, the Q-value updating mechanism (l24ll is also used to 
improve the convergence of the newly adopted links in the 
network. 

In Table |2 we categorize the works discussed in this 
subsection according to their respective applications. For 
applications of multi-agent independent-learning schemes in 
wireless networks, convergence of learning remains an open 
issue in most of the existing studies. Compared with the 
SAS-based learning algorithms, adopting independent learning 
schemes requires more attention for any specific networking 
optimization problem. 

B. Experience Sharing Based on Distributed Learning 

Apart from improving the expected network performance 
with shared information in the form of structured reward/state- 
value functions (e.g., using the social reward and the 
DVF/DRV functions), another consideration in MAS-based 
learning is whether information sharing can also help the 
individual learning agents to speed up their learning processes. 
To answer this question, it is necessary to investigate into 
the homogeneity of the distributed learning processes so that 
we can check whether one learning process may be able 
to benefit from the “shared experience” offered by another 
learning process, and furthermore, in what form such a “shared 
experience” would be. 

We call a group of distributed learning processes homo¬ 
geneous when the distributed learning agents apply the same 
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Fig. 12. Docitive cycle which extends the cognitive cycle by cooperative 
teaching (adapted from (mi). 

learning method with an evolution determined by exactly the 
same stochastic process. In the framework of homogeneous 
learning processes, it is possible for individual agents to share 
their private experience (e.g., strategies, estimated Q-values) 
with the other agent in order to accelerate the learning process 
and improve the performance. Recently, the possibility of 
applying the teacher-pupil paradigm in human cognition to 
solve the wireless networking problems has been discussed in 
a series of studies Ill31l - lll34l . In these pioneering studies, 
the paradigm of “docitive network” was proposed based on 
the extension of distributed cognitive networks (Figure [12]). In 
the framework of docitive networks, “docition” (teaching) is 
performed by a more experienced network agent to accelerate 
the learning process of the other agents. Depending on the 
degree of docition among the wireless devices, the teaching¬ 
learning process can be distinguished into 3 categories II131I : 

■ Startup docition: each wireless node learns independently. 
When a new node joins the network, instead of learning 
from zero experience, it learns the policies from docitive 
nodes which have already acquired a certain level of 
expertise on strategy selection. 

• Adaptive docition: the nodes exchange information about 
the performance of their learning processes. The docitive 
nodes share policies and the learning nodes learn from 
the expert neighbors which have the best performance. 

• Perfect docition: each node in the network is able to ob¬ 
serve the joint action and all individual rewards. Based on 
the observation, every docitive node models its interaction 
with the rest of the network as a complete centralized 
MDP separately and selects its individual actions. 

The basic prerequisite for implementing docition in any net¬ 
working problems is that the individual learning processes can 
be modeled as parallel, homogeneous MDPs, through which 
imitating the strategies of the docitive nodes by the learning 
nodes will not influence the policies of the docitive nodes. 
However, empirical studies have shown that relaxing such a 
constraint in the situation of a noncooperative game-like sce¬ 
nario may also help improve the performance of the learning 
nodes II132L II134L II135I . In 11321 , the distributed downlink 
power allocation problem in an IEEE 802.22 WRAN (underlay 
to the TV-Broadcasting bandwidth) is studied. An aggregated 
interference model from the SUs to the PU is considered. The 
channel state experienced by the individual SUs is defined by 
a binary state according to hard thresholding on the aggregated 
interference, which is similar to (ini). Each secondary BS 
ignores the impact of the other BSs on the channel state and 
adopts a standard independent Q-learning scheme to learn its 


own power selection strategy. The docition process is based 
on exchanging the Q-tables among the neighbor secondary 
BSs. In this case, the learning nodes perform either the startup 
docition or the adaptive docition periodically by adopting 
the Q-tables of the expert nodes with the best performance. 
The simulations in 11321 show that the docitive paradigm 
significantly speeds up the learning process with respect to the 
case of independent learners. A similar approach is adopted 
in 0134L M135L which study the power allocation problem 
in self-organized heterogeneous networks with femotocells. 
In these studies, a cross-tier interference model is adopted 
in a manner similar to CSl), while strategy coupling among 
the femto links is also ignored by individual learners. Again, 
here docition is performed through exchanging the Q-tables 
among the neighbor nodes. In 11341 . the similarity metric to 
measure of the correlation between the femto BS strategy and 
the aggregated interference to the macrocell is introduced as 
a user-defined gradient. The proposed metric measures the 
similarity of the policies between two neighbor nodes. With 
the similarity metric, the learning nodes can not only adopt the 
Q-tables from the neighbor nodes with the best performance, 
but also take into account the degree of the similarity between 
their own action-state correlation and their neighbors’. 

While it is relatively easy to implement docition in the 
framework of independent Q-learning based on the model of 
parallel, homogeneous MDPs, it generally remains an open 
issue to estimate the similarity of the policies between two 
neighbor learners when the learning processes are heteroge¬ 
neous. Especially, in the scenario of strategy coupling and 
interest conflict, imitating the strategies or the Q-tables of 
the adversary neighbor node with the best performance may 
result in strategy oscillation. Such a situation can be illustrated 
by revisiting the power allocation problem defined by (fTSI) . 
In the simplified situation of mutual interference with only 
two femto BSs, increasing the transmit power of one BS 
will result in the performance deterioration for the other BS, 
because the interference to the other BS is also increased. 
Consider the case that the BS with the smaller transmit power 
decides to adopt the strategy of its rival BS by increasing 
its transmit power. If independent Q-learning is used by both 
BSs to learn their power selection strategies, the other BS will 
soon discover that it will benefit from increasing its current 
transmit power too. This creates an “arm race” situation in 
which each BS begins to increase its transmit power in turn 
until both the BSs reach their maximum power level, which is 
a typical situation of the prisoner’s dilemma in noncooperative 
games. Such an unwanted situation can be avoided if both BSs 
treat the power allocation process as a noncooperative game 
and adopt the learning methods in games such as Eictitious 
Play (EP) and best response without any docition procedur^ 
As a result, in works such as 113711 the docitive paradigm 
and the game-based learning paradigm are considered two 
controversial frameworks for strategy learning. However, it is 
worth noting that with emerging techniques such as transfer 

^Studies adopting the same mutual interference model as in o within 
the framework of repeated games can be found in dn). In cm, the best 
response without docition ensures the convergence to the Pareto dominant 
equilibria. 
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TABLE XI 

Summary of the main notations in SectionIvI 


Symbol 

Meaning 


The statistic of player i for its opponent’s actions 

ejia-i) 

The estimated probability of player i for its opponent 
to play action a i 

TrH 

The set of best-response actions 

U 

The learning parameter for perturbation in SFP 

at 

The learning factor in SFP 

qfia) 

The estimated frequency of local actions 


The time-varying step size in GP 

Alt 

The learning parameters in GP, LA and no-extemal- 
regret learning 

Pi 

The probability of accessing a channel in a random 
medium access game 

Rm 

The transmit rate over channel m 

Ri{T:i,ai\a-i) 

The regret of player i for playing strategy tt instead 
of playing at 


The regret for agent i not playing a'^ at time t 

C.(s, a_i) 

The conjecture of opponent policy 7r_i(s) by agent 
i at time t 


The reference point for conjecture learning 


learning 01381 and experience-weighted attraction learning 
Ifni . incorporating the teaching process in the game-based 
framework of learning is no longer impossible. For this part, 
we will leave the discussion of more details to Section El 

V. Applications of Game-Based Learning in 
Cognitive Wireless Networks 

Generally, there are the limitations of the distributed learn¬ 
ing mechanisms (e.g., the algorithms reviewed in llVb that 
post the necessity of introducing the game-based learning 
mechanisms in CRNs. By modeling distributed network con¬ 
trol problems as games, it is possible to better address the 
problems raised by device interactions in the networks. Also, 
it is possible to design learning schemes that theoretically 
guarantee the convergence of the individual strategies to a 
fixed point or equilibrium, while such convergence is usually 
not guaranteed by the distributed learning mechanisms. In this 
section, we consider the repeated games as the special cases 
of SGs and introduce the applications of learning algorithms 
based on repeated games and SGs separately. We will organize 
the learning algorithms based on the three game property 
dimensions discussed in Section ITl-CI Our major focus will be 
(a) the rules in each learning scheme; (b) the conditions and 
properties of the games with which a specific learning scheme 
may converge; and (c) the degree of information exchange 
required by each learning scheme to achieve convergence. The 
new notations used by this section can be found in Table IXII 

A. Applications of Learning in the Context of Repeated Games 

Repeated games play an important role in problem for¬ 
mulation for distributed network control. When the network 
evolution is not subject to a stochastic environment, most 
of the network control problems that requires considering 
the interactions among distributed devices can be formulated 
as a repeated game instead of an MAMDR In contrast to 
the MDP-based learning mechanisms that heavily depend on 
value iteration, policy iteration now plays an important role 
in deriving the learning rules for repeated games. In the 
context of repeated games, model-free learning emphasizes 


more on the situation of information locality. This is because 
in many practical scenarios, the information of the local 
utilities, actions or strategies of one network device may not 
be available to the other devices due to either the concern 
of privacy or the lack of enough resources for information 
exchange. In this subsection, we will organize our survey 
on the applications of learning in repeated games according 
to the prototypical learning schemes that they are based on. 
These prototypical learning schemes include (i) hctitious play, 
(ii) gradient play, (iii) learning automata and (iv) no-regret 
learning. 

1) Fictitious Play and Stochastic Fictitious Play: The basic 
prerequisite of the standard FP is that the agents are willing 
to reveal their (discrete) action information to the others after 
each round of play, so they can track the frequency of action 
selection by the other agents ll^ . With FP, agent i assesses 
the distribution of its opponent’s actions at round t as follows: 

K\{a-i) = -I-/(a‘_“\a_i). (26) 


Agent i estimates the probability for the opponent agents to 
play the joint action a_i at round t as: 




Sa'_.GA-. 


(27) 


In this sense, FP is sometimes considered as a model-based 
learning mechanism since with (l27l) it tries to build the model 
of the opponents’ joint policy from accumulated experience. 
However, compared with other model-based, non-learning 
mechanisms such as dynamic programming for MDPs, FP 
does not need any a-priori knowledge of the system or other 
players. Based on (l27l i. FP is defined as any rule that assigns 
the best response to agent i given its current estimation of 
the opponent policy 9\{a-i). Usually, such an operation is 
represented by a\{a-i) G BRi(0*(a_i)), where the operator 
BR( ) derives the best-response action set. Typically, BR^ can 
be derived by maximizing the estimated expected payoff of 
agent i: BRi(6>‘(a_i)) = argmaxagyi, E[ui{a,6l{a-i))]. The 
convergence property of FP in a general repeated game is given 
by Theorem 1^ ifel . 

Theorem 2 (Convergence of FP). F) Strict AT0 are the 
absorbing state for the process of fictitious play. 2) Any pure- 
strategy steady state of fictitious play must be an NE. 

Theorem |2] gives the sufficient condition for FP to converge 
to an NE. Thereby, the convergence of FP-based learning is 
guaranteed in any repeated games that possess at least one 
pure-strategy NE. According to Theorem |2l a typical way of 
checking the convergence condition for EP in a game is to 
check if the game possesses certain properties (such as being 
potential or S-modular ifTSl l that guarantee the existence of a 
pure-strategy NE. 

As long as the learning agents are able to observe the actions 
of the rival agents or afford the overhead for action information 
exchange, EP can be employed as the basic solution for many 
resource management games in wireless networks. In 11401 . 


*This is equivalent to the condition when the best-response payoffs in the 
NE are strictly greater than the other possible payoffs for all the agents. 
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an FP-based multi-agent learning algorithm is employed by 
the secondary nodes in an ad-hoc DSA network to learn 
the strategies for forwarding delay-sensitive packets. In Ml401 
the condition of channel availability is characterized by the 
matrix of spectrum opportunity, and the condition of channel 
contention is characterized by the interference matrix from 
both the PUs and the rival SUs. With the learning scheme 
proposed in M140I . each SU needs to collect the information 
about the spectrum opportunity matrix locally, and establish 
its local interference matrix according to the action infor¬ 
mation collected from its neighbors. Then, every SU tracks 
the frequency of action selection by its neighbors according 
to a modified version of (|26] | with a discount factor k\~^. 
Each SU also needs to determine a subset of feasible actions 
that do not interfere with higher priority traffic. This is 
done through estimating the expected interference based on 
the policy estimation model in dZTl i. The local deterministic 
best response is calculated based on minimizing the expected 
effective transmission time over the candidate links. 

Another example of FP can be found in M141I . which applies 
FP to obtain a defense mechanism against eavesdropping and 
jamming attacks in the uplink of a cellular network consisting 
of multiple relays (Figure fOl l. In the defense-attack game, 
the normal/malicious nodes are assumed to be able to observe 
the actions of other nodes, so they can use the models in 
(|2^ and dZTl l to estimate the other nodes’ policies. Instead of 
directly obtaining a deterministic strategy based on the local 
best response, each normal node updates its mixed strategy at 
time slot t as follows GD: 

= 7r-"^(m) -F i(/(a‘,m) - 7r*"^(TO)), (28) 

in which m is the index of the candidate relays. The malicious 
node adopts a similar policy-updating rule based on its own 
action set for attacking. The actions of each node at round 
t are selected from the best response based on the expected 
private utility with the locally estimated policy vector (tt*, 0 |). 
The same learning rule as in Ell can be found in Ea, 
which uses the local policy updating rule in (|28] | to learn the 
strategy in a continuous strategy space for power allocation. 
In lll42l . such a learning scheme is referred to as the best 
response dynamics of the power allocation game, and is proved 
to be able to converge to the e-equilibria. Such a learning 
rule is also adopted in M143I . which formulates a hierarchical 
network formation game for nodes in a multi-hop wireless 
network to select relays. In MI43I the relay selection game is 
decomposed into multi-layers and solved using a backward 
induction method from the sink to the source. The learning 
scheme defined by (I26ll-(l28]l is applied to each layer-game and 
the mixed strategies are obtained from the local best responses. 


Base 



Fig. 13. A network consisting of M one-hop relays and N wireless us ers th at 
is subject to eavesdropping/jamming from one active malicious node ED 

ing the best response with a modified local objective function 
that is perturbed by a differentiable, strictly concave function. 
Assume that the best response is obtained through maximizing 
a payoff function 7r_i). Then the operation for obtaining 

the smoothed best response BR( ) can be used to replace the 
original best response argmaxMijTr^, 7r_i): 


BR(7r_i) = aTgmax{ui{TTi,TT_i) + vrii{TTi)} , (29) 

TZi 


in which the perturbation function rji is typically given as the 
entropy function of tt^: 


Problem (|2^ with (l30] l can be explicitly solved as: 


BR (^_0 




(30) 


(31) 


in which o is the weight of the perturbation term that controls 
the strategy exploration rate. It has been proved that for any 
average-reward repeated game, we can always find the ly that 
makes the payoff of agent n under BR(7r_„) to be sufficiently 
near the real best-response payoff (Proposition 4.5 of 1331). 
The SFP-based learning scheme is also known as the stochastic 
FP. Unlike standard FP, in SFP it is not necessary to observe 
the opponents’ actions or even know the structure of the local 
utility functions. Instead, the expected payoff u\{ai,TT-i) in 
OTI) is estimated based on local information as follows: 

= (u‘(a0-u‘"^(aj))-Fu‘“^(aO, (32) 


in which and I{a\^,an) follow the same definitions as 
in (l26l l. and u\^{an) is the estimate of the expected utility 
ul{ai, TT-i). The local mixed policy is usually updated in the 
following form: 


Trl(m) = TT- ^(m) -F a* (BR(f(*(ai)) - tt- , (33) 


With the standard FP, local actions are updated based on 
the best responses, which are generally of pure strategies. 
As pointed out by 13^ . one drawback of such an FP-based 
learning scheme lies in the discontinuity of agent behaviors, 
for a small change in the opponent-policy estimation may 
result in an abrupt local-behavior change. Due to this, a 
Smoothed-FP (SFP) procedure was proposed through search- 


in which BR(u‘(ai)) is calculated based on OTI) with the 
payoff estimated by 02 l i and at is a learning factor. 

It is worth noting that with both value iteration in 02l l and 
policy iteration in 031 ), SFP is usually considered as a typical 
form of CODIPAS-RL methods (see the example in MlTTM l. 
Generally, the convergence conditions of SFP are based on the 
analysis of Lyapunov stability of the corresponding perturbed 
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best response dynamic fTbl . A summary of these conditions 
for different types of games is given as follow^ 

Theorem 3 (Convergence of FP ||7^ ). Consider SFP defined 
by ( I29l )-fl33l> starting from an arbitrary strategy in game G. 

(i) If G is a two-player symmetric game with an interior 
Evolutionary Stable Strategy (ESS) or a two-player zero- 
sum game, then SFP converges with probability one to 
an NE. 

(ii) If G is an N-player potential game, then SFP converges 
in a subset of the rest points of the perturbed best 
response dynamic. If all the rest points of the perturbed 
best response dynamic are hyperbolic and two-order 
continuous, SFP converges to an NE with probability one. 

(iii) If G is an N-player supermodular game, then SFP 
converges almost surely to a rest point of the perturbed 
best response dynamic. In particular, if the rest point is 
unique, then SFP converges to the NE with probability 
one. 

With the property of requiring no information exchange, 
SFP is considered an important tool in self-organized learning 
for resource allocation games. In 11451 . SFP is applied to 
the power control game in wireless ad-hoc networks. Ac¬ 
cording to Theorem [3 SFP is guaranteed to converge to a 
stationary point (with a non-zero probability to an NE) for 
a supermodular/potential game. In order to take advantage of 
such a property, a supermodular utility function is designed 
for each node in Il45l . and the convergence with SFP is 
thus guaranteed. However, since the utility function in 01451 
is monotonically decreasing, the learning scheme will finally 
converge to the unique NE of that game, which corresponds to 
all users transmitting with zero power. This problem of utility 
function design is addressed in 01161 by studying the power 
allocation problem in a small-cell network through a non¬ 
trivial Stackelberg game d). This game design is intended to 
balance the femtocell power efficiency and interference control 
in the macrocell. The supermodularity property is retained for 
the femto link utility, and the SFP-based scheme give in (13X1 
l33l l is applied to the follower game among the femtocells. 
The same learning mechanism is adopted in da, which 
considers the power allocation in the femtocells as a common- 
payoff game (thus a potential game). With the assumption of 
the common-payoff game, it is proved in Una that the e- 
equilibrium is guaranteed to be reached in the potential game 
by the SEP-based learning algorithm. 

2 ) Gradient Play: Compared with EP, Gradient Play (GP) 
adjusts the strategy of one agent based on the gradient ascent 
dynamics instead of directly jumping to the best response 
based on the empirical frequencies of the opponent agents’ 
action selection. Therefore, GP can be viewed as a “better 
response” algorithm. Mathematically, following the learning 
scheme of the standard GP, each agent in the repeated game 
updates its strategy on selecting according to ED: 

= [Qiicti) + , (34) 

^About the definitions of ESS, rest point and supermodular game, please 
refer to □a for more details. 


where ct is the time-varying step size, defines the 

projection onto the strategy space H^ of agent i, 0\{a-i) 
is the estimated opponent-action frequency, which can be 
derived following (l27T i. and q\{ai) is the estimated local-action 
frequency, which can be derived in the same manner as (l28T i: 

- g-(ai)), (35) 

where action a* is generated as random outcomes of the 
evolving strategies qf Eollowing (l34l l and dTSl) . the strategy of 
each agent is a (projected) combination of its own empirical 
action frequency and a gradient step based on the estimated 
opponents’ action frequency. According to 117111 . II146L GP in 
continuous games is guaranteed to converge within a distance 
of order of e* of the NE of the game, if the NE is a strict one. 
However, GP cannot converge to a completely-mixed NE of 
the game (see Lemma 4.1 of ED)- Due to such a limitation 
on convergence condition, the basic form GP in (l34l i is rarely 
used directly in the solution to networking problems. 

As an improvement to the basic form GP, Derivative-Action 
GP (DAGP) is developed in ED- By introducing parameter 
vl{ai) to approximate the first-order derivative of qi, the 
updating mechanism of DAGP is defined as follows II 14611 : 

(36) 

PtiqKai) - v\{ai))\^\ 

where qj is updated following ( 1351 ). e* and dj are obtained 
in the same way as in (l34l l. and pt is a large factor satisfying 
/it > 0. According to ED, Em, for large pt > 0 , if e satisfies 
certain conditions (see Theorem 4.2 of ED and Theorem 3.1 
and Theorem 3.3 in 11461 for more details), the strategy tt* is 
asymptotically locally stable and converges to the NE with a 
non-zero probability. 

GP and DAGP not only require the agents to be able to 
track the frequency of both the local actions and the opponent 
actions, but also require that the structure of local utility 
functions is known to each agent. Compared with EP and SEP, 
the most important feature of the GP-based learning algorithms 
is that the updating mechanism can be easily extended to the 
cases of continuous games. In 01471 . standard GP is applied 
to the continuous, random medium access game, in which 
a set of wireless nodes learn to play the random access 
strategies pi (0 < pi < 1) after observing the vector of 
channel contention signal q^. Instead of directly adapting to 
the contention signal q^, each wireless node introduces a price 
function Gi{qi) to adjust its local net payoff with the original 
utility function Ufipt) as ufip) = Ui{pi)-piCi{qi). In 01471 . 
the random access game is proved to have a unique nontrivial 
NE (namely, Vp.u{p*,p*_f) = 0 at the NE (p*,pLi)), and 
that the standard GP converges geometrically to the nontrivial 
NE if a certain condition is satisfied with the step size e* 
in (l34l i. The application of standard GP can also be found 
in the power control game of a multi-cell CDMA network 
with dynamic handoffs between cells 01481 . After introducing 
a pricing mechanism with the cost function based on the local 
power consumption, the game formulation in adopts 
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a payoff function that is twice continuously differentiable, 
non-decreasing and strictly convex. It is proved in B148I that 
standard GP is able to exponentially converge to the smallest 
convex set which contains all the possible NE of the power 
control game, if the spreading factor of the CDMA system 
satisfies certain condition^ 

One typical example of applying the DAGP-based learning 
to networking problems can be found in II149I , which for¬ 
mulates the interference coordination problem in a multi-link 
MIMO system as a noncooperative game. In the game, the 
covariance matrix of the signal of each link is considered as 
the local strategy and is drawn from a common, continuous 
strategy space. The matrix form of is adopted and 
guaranteed to converge to a unique NE of the game, if the 
covariance matrix of the total interference and noise at the 
receiver of each link satisfies a certain condition. 

3) Learning Automata: As introduced in Section III-AI 
LA is featured by the process of action selection based on 
policy iteration using only local information. Eor non-game- 
based wireless networking problems, (distributed) LA has 
been shown to be efficient in the scenarios which can be 
formulated to be of single state and controlled by a single 
active decision-making entity at one time instance. Successful 
applications of LA in these scenarios can be found in the 
works such as multipath on-demand multicast routing in CRNs 
II 15011 and multicast routing in mobile ad-hoc networks 01511 . 
When it comes to the more complicated framework of network 
control games, most of the LA-based learning schemes are 
employed to obtain NE policies. As a special case of the 
general LA updating rule (HU, Lk-/ learning has been widely 
applied to network control problems due to its simplicity and 
convergence property. By abusing the notations in (fTTT i. the 
rules of Lfi-i learning can be expressed as dMt : 


.t+i 


(a,) = 


7r‘(a,) - nrlnKa,), 


if Uj 

if a* ^ Ui, 


(38) 


where ^ (0 < ^ < 1) is a learning parameter. The convergence 
property to the NE for the learning mechanism in dMt in a 
general noncooperative game has been proved in 015211 : 


Theorem 4. In a repeated game G = {■N',A — xAn,{0 < 
fn < l}raG7\f)> each agent employing learning, the 

following statements are true if p, in ( I5(SI ) is sufficiently small: 

• all stationary points that are not NE are unstable, and 

• all strict NE in pure strategies are asymptotically stable. 


However, no uniform expression is provided in the literature 
to obtain the normalized environment response function f* in 
(|38j- For example, in II153I , standard Ln-i learning is adopted 
to manage the opportunistic spectrum access by N SUs over 
M primary channels with a fixed transmit rate Rm on channel 
m. In this case, the normalized random reward f* is obtained 
as follows: 

(39) 

n 


^^“Exponential convergence” is used to describe the property of learning 
when asymptotically converging to the convex set. If \\7rj — 7r*|| = 0(/i*) 
for some /i < 1, we say that the learning process achieves exponential 
convergence. 


where is the instantaneous reward of SU m after con¬ 
sidering the PU activities and the channel contention with 
its rival nodes. The opportunistic spectrum access game is 
further modeled as an exact potential game. Therefore, at least 
one pure-strategy NE exists for the game US). According to 
Theorem |4l learning ensures the convergence to the 

pure-strategy NE in the opportunistic spectrum access game. 
Apart from II153I . the standard Lr-i learning scheme can be 
found as a frequent solution to the problems whenever the 
convergence property of Theorem |4] is satisfied and the exis¬ 
tence of a pure-strategy NE can be proved. The applications of 
the standard Lr-i learning scheme range from relay-selection 
in the cooperative network 11541 to the CSMA-based DSA 
management II155II and the MIMO-based DSA management 
1115611 in the CRNs. 

In contrast to the aforementioned works, the variation of the 
standard Lr-j learning mechanism using a different strategy¬ 
updating rule can also be found in the studies such as il57t . 
In 0371 . a discrete power control problem in a CDMA-like 
cellular network with mutual interference is modeled as a 
repeated noncooperative game. In the power control game, 
each node only knows its local payoff measured as the power 
efficiency. The modified linear-reward-inaction updating rule 
in 1153 is defined as follows: 

f 7r*(a‘) if a* f Oi, 

= 7r*(a0 + pfl E if a‘ = a.. (40) 

a^ai 


Let u* denote the utility of node i by choosing a discrete power 
level a\ for transmission at time t. Then, the normalized utility 
feedback fj is obtained as follows: 


ul - min^juj 

maxj{ui} - minijui} 


(41) 


The major difference between ( l40b and ( l38l l lies in the way of 
updating the probability of choosing an action when the action 
results in a new reward. Under this learning algorithm, the 
evolution of the power selection becomes a Markov process. 
Eollowing the same approach of proving the convergence prop¬ 
erty based on Ordinary Differential Equation (ODE) analysis 
and Lyapunov’s stability theorem as in 01521 . it is proved in 
ESS that the LA-based learning scheme in (l40l) will only 
converge to the mixed-strategy NE of the considered power 
control game if the learning step p is sufficiently small. 

In addition to Lr-i learning, other learning schemes based 
on the general LA updating rule in (fTTT i are also employed for 
resource allocation in the CRNs. In II158I . an LA mechanism 
based on the softmax (Logit) function is applied to learn 
the e-optimal solution to the traffic allocation problem in a 
multi-hop cognitive wireless mesh network. With the proposed 
LA mechanism, node i’s local action to select link k for 
transmitting at the n-th possible rate is determined by the 
softmax function: 


exp(i 


'i,kJ 


,7V / 


i,kl 


(42) 


where N denotes the number of possible transmit rates and 
the intermediate parameter is updated according to the 
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Interference Channel _^ Interference Channel 

among CRs'Tx and Rx from CR-Tx and PU-Rx 



^ CR-Tx is hearing PU-Rx's 
Feedback Channel 


Feed back Channel from 
PU-Rxto PU-Tx 


Fig. 14. A toy example of power allocation in the multi-user CRN with 
limited ability of acquiring the strategy information from other CRs QH). 

following LA rules; 


Cfc(^ + 1) —' 


forn = j; 

<,k (0 + {t), for n j. 


(43) 


In (|43]l, at (0 < at < 1) is the learning rate and 
is obtained from a set of i.i.d. random variables with zero 
mean. S(f) is the normalized utility feedback that is provided 
by the gateway node. In order to ensure the convergence of 
the learning algorithm in (l43T l. the traffic engineering game is 
modeled as a team game with the identical payoff (hence a 
potential game). Thus the SUs need to share the information 
on the global, normalized utility feedback S(f) for updating 
the value of In II158I . the value of 5(<) is obtained 

from arbitrarily scaling the sum of the local payoff functions 
down to the range of [0,1]. By allowing information exchange 
and constructing an A^-person potential game, it is proved in 
II158I that for sufficiently small values of at and the variance 
of Ci la mechanism in (l43T l is guaranteed to achieve 

the e-optimal solution to the traffic engineering problem. 

In lll59l . Bush-Mosteller LA llTOl is adopted for learning 
the NE of the repeated power control game in a CRN with 
the set of power constraints on the aggregated interference 
experienced by each PU (Figure O. Bush-Mosteller learning, 
also known as the linear reward-penalty LA, can be viewed 
as a general form of Ljt-j learning ilMl . In 0591, the 
CRN is assumed to be composed of N SUs and M PUs. 
The wireless channels are assumed to be stationary, and the 
SUs are able to monitor each PU’s feedback indicating the 
sum of interference to each PU receiver. It is also assumed 
that no SU can observe the strategies of the other SUs 
(see Figure [Hi. Let Uk{TTk,T^-k) be the expected utility of 
SU link k and Wi{Trk,TT-k) be the corresponding expected 
interference at PU I, the constrained game is transformed 
into an unconstrained game with the help of the Lagrange 
multipliers. The Lagrange function of SU k is defined with a 
regularization term 6/2 (Htt^IP — ||Afe||) as follows; 


Lfcfc5 Afc) — UizijTk—k/) 1 A/(PU/(TTfc, 7r_fc) 


(44) 


where A; is the Lagrange multiplier for the constraint from 


PU I, Afe is the vector of A; and Wi is the maximum level 
of the interference to PU 1. It is shown in II159I that finding 
the equilibrium point of the original constrained power control 
game is asymptotically equivalent to determining the equilib¬ 
rium point of the unconstrained game with the regularized 
function given in (l44l i. The following learning scheme, based 
on linear reward-penalty LA, is adopted to update the local 
policies; 


V =^k + 0‘k[eNdPk)-^k + 


ftJ^Nk- 


Nk^N,{Pl)), 


Nk-l 


(45) 


where is the power level that SU k chooses at iteration t. 
eNk{Pk) defined as follows; 


etv,(P|) = (0,...,0,l,0,...,0)^, 


.Nfe _ 


= ( 1 ,...,!)^ 


(46) 

(47) 


The normalized utility feedback f\. is obtained based on the 
Lagrangian with the expected utility and interference being 
replaced by the instantaneous payoff and interference in (l44l l. 
With a user-defined normalization procedure, the value of 
Pf. is scaled within the interval [0, iJiH The time-varying 
correction (adaptation) factors a], also belong to the unit 
segment. Meanwhile, the Lagrange multiplier is updated as; 

= (48) 

i/l = S*Xl — rfi + Cl, (49) 

where rji is the instantaneous sum of interference at PU I 

and 6* is the regularization factor in (l44li . and is a 

projection operator. The learning scheme defined by (I45l) -(l49]l 
ensures the convergence to the NE, provided that the sequences 
{r]\} and {<)*} satisfy certain properties (see Assumptions 
A1-A3 in 01591 1. and the power control game is diagonal 
concave Eq). Compared with Ln-i learning, Bush-Mosteller 
LA requires stricter condition for converging to the NE. 
This is a major reason for impeding Bush-Mosteller learning 
from being widely applied to the wireless resource allocation 
problems. Due to the requirement for the game to be diagonal 
concave, and because the original SINR-based utility does 
not naturally possess the property of diagonal concavity, the 
authors of inssi use an arbitrarily designed utility function 
to replace the real expected mutual-interference-based local 
utility in order to derive the proper payoff function for the 
constructed power control game. 

4) No-Regret Learning: Usually, the terminology “no¬ 
regret learning” is used to refer to any learning algorithm that 
exhibits the property of no-regret when compared with the 
set of some designated strategies ll72l . II161II . Formally, for an 
infinitely repeated game G = {Af,A = xAn, {un}nGN)^ 
given the adversary (deterministic) strategy a_i, the regret of 
agent i for playing strategy tt^ instead of choosing strategy at 
can be defined as the difference in its payoff obtained from 
playing these strategies; 


— UiiOji, Qi—i) Ui(^7Tt, 0,—i^ • (50) 

"For the detailed derivation of fj_, please refer to (31) and (32) in Go) 
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Let (/){■) denote a modification mapping tt' = (p{TTi), where 
= Y.b:4,(b)=a ^ ^ ^ Sequence of 

adversary strategies {a*_j}, we can define a general no-regret 
learning algorithm (also known as ^-no-regret learning) for 
agent i as follows mill: 


Definition 6 (^-no-regret learning). For a finite subset $ of 
memoryless mapping (j), a learning algorithm that generates 
TT* is said to exhibit fi-no-regret if the regret of that learning 
algorithm, 

= Ui((/)(7r‘),a‘_i) - ufiirlaU), (51) 


satisfies the following condition: 

1 ^ 
t=l 

There are two well-studied categories of the (/)-no-regret 
properties; no-external-regret and no-internal-regret 01611 . The 
no-external-regret property is to minimize the regret with 
respect to any comparison class of algorithms that lead to 
deterministic strategies. In other words, for no-external-regret 
learning, the mapping (/){■) satisfies (^(tt^) = a (a G Ai). The 
no-internal-regret property is also known as no-swap-regret 
since the property of internal regret swaps the current online 
strategies as follows: 

{ 7 ri(c), if c ^ 0 , 6 , 

0, if c = a, (53) 

7 ri(a) -I- TTi{b), if c = b. 


One well-known example for applying no-external-regret 
learning to the wireless networking problems is II162I . which 
uses the random weighted majority (i.e.. Hedge) algorithm 
ESI for learning the NE strategies in a channel allocation 
game in a CRN. With a careful utility design, the channel- 
allocation game is proved to be an exact potential game. Let 
u\{ai) denote the cumulated instantaneous payoff received by 
SU i given the sequence of the adversary strategy {a^_j}, the 
mixed policy of SU i is updated as follows: 




[af) = 


(1+A^)’ 


\Ai) 




nfa'.) > 


(54) 


where /i > 0. It is well-known that the learning scheme in 
(l54t has a regret bound as Dj < pj2 llTSl . Compared with 
the widely applied best-response-based learning schemes for 
potential games, which also ensure the convergence to the NE, 
the random weighted majority algorithm ( l54b does not need 
any information sharing between SUs. 

The construction of a no-external-regret learning mechanism 
can be further illustrated by the example of 11641 . where the 
problem of collaborative sensing with malicious nodes in an 
A^-channel CRN is studied. In the considered CRN, SU j is 
supposed to collaborate with a set of its neighbor SUs A/} and 
to choose whether to aggregate one of their sensing reports 
into its local channel-state prediction. At time t, a mixed policy 
TTj = ..., J is adopted to choose the reports from 

the SUs in Afj. With the goal of minimizing the long-term 
expected loss due to false decision by choosing the sequence 


TTt instead of the pure-strategy best response (internal regret), 
we have 

T 

T? yf Gfifj, (55) 

{■"■( }t=i t=i 


where P{j',s*) is the instantaneous loss due to adopting the 
report by SU j', and T {Tr{, s‘) is the average loss with policy 

nj at channel state f (tt^.s*) = E/eAGuL} 

In II164I . such a decision process is modeled as a two-pla yer 
constant-sum game. In the game, SU j plays against nature *T 
which plays as an adversary player and chooses state s aiming 
at causing the worst cost to SU j. The strategy-updating 
mechanism is designed upon the softmax function (l42l i with 
the accumulated instantaneous loss Et=i being the 

argument of the logarithmic function exp(-). It is shown in 
EH that no-regret learning based on the softmax function 
converges to the NE, which is equivalent to the minimax value 
of the game. 

Another category of no-regret learning algorithms that are 
widely applied in the context of network control aims at 
minimizing the internal regret and learning the CE in repeated 
games G3- Eor a general repeated game G = {ff,A = 
^Am{un}nGM)^ the estimated average loss for agent i to 
play action a\ instead of playing a' at time t is given by: 

Dl{a\,a[) = (u*(a',aE) • (56) 

T<t 

Based on (l56l l. the regret of agent i for not playing a' is 


= max{L)*(a‘,a-), 0 } . 

With (l57l) . the mixed policy of agent i is updated by 
1 

= 



Va ^ a‘, 


t+i 


K), 


a = a^, 


(57) 


(58) 


where p is a sufficiently large constant to ensure that tti {i G 
ff) is a well-defined probability. 

Like the random weighted majority algorithm, the learning 
scheme defined by (I56l) - (l58l) to learn the CE does not need 
the agents to exchange the action/utility information. The no- 
internal-regret learning scheme ensures the asymptotic conver¬ 
gence to the set of the CE, according to Theorem |5] 1721 : 


Theorem 5. If every agent plays according to the learning 
scheme defined by ( l56l )-il5iSI). the empirical distribution of the 
joint action selection: 

^T(a) = i|f<r:a‘ = a| (59) 

converges almost surely to the set of CE of the game G as 
T ^ oo. 


The applications of the learning scheme given by (I56l) - (l58l) 
to network control problems can be found in EH-EH. 
As one of the earliest works that employ no-regret learning 
in the network control problem, it is aimed at obtaining the 


^^The definition of nature in an extensive form game can be found in na 
















25 


CE in a dynamic spectrum access game with an overlay CR 
network in B165I . No-regret learning is used for the SUs 
to address the problem of channel contention. It is shown 
that the performance at the CE obtained through learning is 
almost as good as the optimal equilibrium in the set of CE. In 
II166II . a joint power-channel selection problem is studied in 
an underlay CRN with a free band and a set of price-charging 
PU channels. The no-regret learning algorithm (I56]l-(l58ll is 
aggregated with an auction game, which considers the SINR 
to the PU or the allocation power as an item for auction. The 
joint power-channel selection game is played in two levels. 
In the lower-level subgame, the SUs perform the SINR/power 
bidding game with a fixed set of PU-channel selection. In the 
higher-level subgame, the SUs adopt the no-regret learning 
algorithm (l56l l- (l58T l to obtain the CE in the channel-selection 
game. In I167II . the learning scheme of (l56l l- (l58T l is adopted to 
obtain the CE strategies in a spectrum sensing game among 
heterogeneous SUs in an overlay CRN. In the game, each 
SU chooses either to cooperatively sense the PU channel that 
it is assigned to with some power consumption (i.e., with 
some cost), or to directly access the channel as a free rider 
(i.e., without any cost) based on the sensing reports by the 
neighbor SUs. With the proposed no-regret learning scheme, 
the strategies are obtained based on minimizing the total regret 
of the neighborhood set of an SU rather than the individual 
regret. It is shown in 11671 that the learning scheme with the 
neighborhood regret can significantly outperform the learning 
algorithm based on the local regret. This is also considered 
as the main reason that motivates local SUs to share their 
local action and payoff information for neighborhood learning. 
In 01681 . the scheme in (I56l l- (l58l l is applied to learn the CE 
of the subcarrier allocation strategies in a multi-cell OEDMA 
network. Again, each link in the subcarrier allocation game 
does not need to know the private strategies and utilities of 
the other links. 


The no-internal-regret learning scheme (l56]) -(l58ll only re¬ 
quires that the structure of the local payoff function is known 
to each agent. Compared with the NE-driven learning meth¬ 
ods such as EP and best-response learning, no-internal-regret 
learning could achieve a better social performance (i.e., in 
terms of sum of the players’ rewards). Since the set of CE 
is a convex polytope with all the NE lying on one of its 
sections 0691, it is possible for the no-internal-regret learning 
algorithm to reach a CE that is not in the polygon of the NE, 
thus resulting in a better performance than any NE. Although 
the learning rule of (l56l l- (l58T l does not guarantee convergence 
to the social optimal CE, a number of empirical studies (e.g., 
no-regret learning in the cognitive congestion control games 
ina, oni) show that the no-regret learning scheme can 
significantly outperform best-response learning and EP 01701 . 
Moreover, its convergent strategy can be considered as a good 
approximation of the global optimal solution II171II . As a result, 
many studies consider the no-internal-regret learning scheme 
as an approach to implicitly enforce cooperation within the 
framework of general-sum noncooperative games. 


B. Applications of Learning in the Context of Stochastic 
Games (SGs) 

SGs generalizes both the repeated games and the MDPs 
by allowing the payoff of the players at each round of the 
game to be dependent on the state variable, whose evolution 
is influenced by the joint actions of the players. Compared 
with the models based on repeated games, SGs are considered 
a more practical tool for modeling the agent interaction in a 
stochastic wireless environment, especially when the elements 
of the wireless environment (e.g., channel states, buffer states 
and collision states) evolve stochastically and are influenced 
by the transmission strategies of the wireless agents. In the 
context of SGs, the model-free learning schemes are referred 
to the value/policy-iteration algorithms (e.g., the algorithms 
summarized in 01721 ) that do not require any a-priori knowl¬ 
edge about the state transition of the wireless system. We 
note that such a property makes model-free learning especially 
appropriate for finding the solution to the equilibria of the 
SGs in the context of wireless networks. This is because in 
most of the practical scenarios it is difficult to obtain all 
the details of the system dynamics due to the complexity 
of the network. In what follows, we organize our survey 
on learning in SGs according to the approaches used for 
experience updating (i.e., value-iteration-based learning vs. 
non-value-iteration-based learning). 

1) Value-Iteration-Based Learning: In contrast to those 
model-based solutions which use linear programming to obtain 
the NE (see the example of a constrained power control 
SG II173I ). value-iteration-based learning algorithms generally 
need to construct a series of intermediate “matrix games” 
from the original SGs. Consider a general discounted-reward 
SG, G = (A/", 5,^, {u„}„gAr,Pr(s'|s,a)), a matrix game is 
defined based on the current estimation of the state value of 
the SG, which is derived in a similar way as 

Definition 7 (Matrix game lICTl ). An n-player matrix game 
(also known as stage game) in an SG is defined as a tuple 
G(s) = in which 

(1 < * < lA/”!) is given by: 

Q^.z(s,a) = u(s,a)-f/3 ^ Pr(s'|s,7r)U^’^i(s'|s, tt,,(60) 

We note that in dhOll, U^j(s'|s, TTi, 7r_i) = -(s, a)}. 

Under policy tt, transition probability Pr(s'|s, tt) can be ex¬ 
pressed as follows: 

Pr(s'|s,7r)= ^ ... ^ f Pr(s'|s,ai,... ,a|A^|) 

aiGAi 

X7ri(s,ai)---7r|^|(s,a|jv|)y 

( 61 ) 

According to Definition |7] a general form of strategy searching 
based on value iteration can be implemented as in Algorithm 
in B172I . In (l62l i of Algorithm [T] operator EvaU(-) computes 
(estimates) the expected payoff in the NE of the matrix game. 
The equivalence between the NE of the matrix game and the 
NE of the discounted SG is given by Theorem |6] 
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Algorithm 1 Value-iteration-based learning algorithm. 
Require: Initialize j,Vl < * * < lA/”! arbitrarily. 

while convergence criterion is not met do 

(a) For state s at round t, update the estimated value of 
(5*(s,a) of the matrix game. 

(b) For state s, update the expected state value of 
Vp j (s) after computing the (mixed) equilibrium strategy 
(7ri(s),7r_,(s)): 

^ EvaU(Q/3,i(s,a)). (62) 

end while 

Theorem 6 (ll67ll. The following are equivalent: 

• TT* is an equilibrium point in the discounted SG, G, with 

equilibrium payoffs (Vs^i(7r*),..., (tt*)). 

• For each s G S, strategy tt* (s) constitutes an equilibrium 

point in static matrix game G(s) with equilibrium pay¬ 
offs {EvaGffQpffs^a)),... ,EvaGffQl 3 ^\j^\{s,Si))). The 
value of is given by Definition^ 

According to Theorem |6] Algorithm [T] can be considered 
a combination of a matrix-game solver and a value-iteration- 
based state value learner. It works as the general form of a 
set of model-free strategy-learning algorithms, which differ 
from each other only in the way of dehning operator Eval^(-). 
In lIMl . operator Eval 7 r(-) in value iteration is implemented 
by a minimax optimization process, and the Q-value of each 
learning agent is updated through a standard single-agent 
Q-learning process. Such a learning scheme is known as 
minimax-Q learning. Specifically, the learning mechanism can 
be expressed by 

^ {I-at)Q^pfst,a\,atf)+ 
at (u,{st,al,aG) + l3Vl^{st+i)^ , 

max miny^ Qi fst,ai,a-i)TT{st,ai), (64) 
7r‘(s, Oi) = arg max minY^ Qi As,ai,a-i)Tr{s,ai). (65) 

7r(s,Oi) a-i 

The solution to (l65l l is usually obtained through linear 
programming, which requires that the matrix game of the 
SG is of complete information. It is worth noting that (|64] | 
is an approximation of the exact state value, fst) — 
niax^(s^^„.) .) Eae.A a)7r(st, a), which 

cannot be obtained directly since the local strategies are 
usually private information. Due to the approximation, the 
updating mechanism in (l65t - (l65T l. although proved to be 
effective by empirical studies 1^ . does not provide a strict 
condition for convergence to the NE. 

Minimax-Q learning is usually adopted to solve the prob¬ 
lems which can be described as a constant-sum (also known 
as strictly competitive) game. One typical category of its 
applications in wireless networks is strategy-learning in attack- 
defense problems, since such problems can usually be modeled 
as a two-player, zero-sum game with the group of normal 
nodes and the group of malicious nodes treated as two super 
players. In El, a two-player zero-sum SG is adopted to 
model the anti-jamming process of a group of SUs in the 
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Fig. 15. A snapshot of the anti-jamming defense process in a multi-channel 
CRN (adapted from 11741 ). 

CRN (EigurefTsTl. Due to the random activities of the PUs, the 
channel-availability states viewed by the SUs are modeled as 
a group of independent, two-state Markov chains. In addition, 
for each channel, the channel quality measured by the local 
SNR is modeled as a hnite state Markov chain. In II174L 
the devices in the CRN are divided into two groups: the 
normal SUs and jamming nodes. Both the normal SUs and 
the attackers access the PU channels in a slotted manner. At 
each time slot, the normal SUs will select a subset of channels 
for transmission while the attackers will select a subset of 
channels for jamming. The group of channels that are selected 
for transmission are further subdivided into control channels 
and data channels. Eor a normal SU, the non-zero gain of a 
channel can only be achieved when the channel is used for 
data transmission and at least one control channel selected 
by the normal SU is not jammed by the attackers. The goal 
of the normal SUs is to maximize the local channel utility. 
Based on the formulation of the two-player zero-sum SG, 
the standard minimax-Q-learning algorithm is applied for the 
normal SUs to hnd the equilibrium strategies in the stochastic 
attack-defense game. Convergence of the learning algorithm 
has been shown by empirical studies. Also, the numerical 
simulations show that minimax-Q learning outperforms both 
the myopic strategy, which does not consider the future payoff, 
and the hxed strategy, which uniformly selects the channels 
regardless of the attacker’s strategy. 

The application of minimax-Q learning in a similar scenario 
can be found in 11751 . which formulates the competition 
for open access spectrum in a tactical wireless network as 
a competitive mobile network game. The study in Ea 
extends the attack-defense model in El by dividing the 
competitive mobile network into two sub-networks: the ally 
network and the enemy network. Each network is composed 
of both communicating nodes and jamming nodes. The goal 
of the two networks is to achieve the maximum spectrum 
utility while jamming the opponent transmission as much as 
possible. The channel-availability state is jointly determined 
by the transmission-jamming actions of the two networks as a 
controlled Markov chain. Channel access in the competitive 
network is modeled as a two-player, zero-sum game, and 
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Standard minimax-Q learning is adopted for both the ally and 
the enemy network to learn their equilibrium strategies. Apart 
from Ea, other applications of minimax-Q learning can be 
found in II176L II177I . which basically adopt the same frame¬ 
work of the two-player, zero-sum SG as in ini, iiiTSi to 
obtain the anti-jamming scheme. In 01761 . minimax-Q learning 
in the SG is employed in a typical DSA network without 
considering the impact of jamming the control channels. In 
umi, the two-player SG model is extended to the scenarios of 
stochastic routing in a MANET, and the attack-proof strategy 
is obtained through minimax-Q learning. 

For networking problems that need to be described as an n- 
player general-sum SG, a more general learning scheme can be 
implemented by replacing the minimax operator for Eval^(-) 
with the operator that leads to the payoff of the NE in the 
general game. For the discounted-reward general-sum SGs, 
such a learning scheme is known as Nash Q-learning ||66|. 
Nash Q-learning adopts the same Q-value updating scheme 
(|6^ as in the minimax-Q learning algorithm, and requires that 
the value of j(st) is obtained based on the matrix game NE 
of the SG. According to Theorem|6] as long as the NE of each 
matrix game obtained from the SG in stage S( is used in (l6^ to 
compute the value of j(s(), the learning process converges 
to the NE of the SG. For Nash Q-learning, operator Eval^{-) 
can be expressed by: 

|A/-| 

aie.4i i=l 

In (l66l l. 7r*(s) is the NE strategy of the matrix game at stage 
t when the payoff matrix of agent i is ^(s, a). 

Theorem |6] also holds when the SG is based on average 
reward. The counterpart to Nash Q-learning in an average- 
reward SG is known as Nash R-learning ll67l . Nash R-learning 
adopts the R-learning-based scheme for state-action updating 
as in dSll and (0, which can be summarized by the following 
equations: 

at{ui{st,Sit) + Vl (st+i) - i?* (st, aQ - /i* (st, a*)), 

( 68 ) 

where l^‘(s) is the equilibrium payoff of the stage game and 
is computed following (l66l l. 

When the goal of the learning process is to find the CE 
of the discounted-reward SG instead of the NE, Correlated-Q 
(CE-Q) Learning can be implemented based on the updating 
mechanism in (l6^ - (l65T l with the state value j(st) estimated 
at the CE strategies Il68l . The equivalence between the CE of 
the original SG and the CE of the matrix game in each state 
still holds. Based on Definition [3] and Theorem |6] we have 

Theorem 7 (CE in the SG IMl)- For a discounted-reward 
SG G, a stationary policy tt is a correlated equilibrium if 
Vi S A/”, Vs G 5, Va G Vl with 'Ki{ai) > 0, for all a- G v4i(s) 

7r(s, 1(s, (u_j, ) V 

7r(s, O—(a_i, cij)), 

o-iGyt-i 


which defines the CE of the matrix game in s as 7r(s). 

For both the NE based Q-learning (Nash-Q and Nash-R) and 
the CE-based Q-learning (CE-Q), it is not specified how the 
equilibrium strategies tt * (s) for each matrix game is obtained 
during the learning process. Since it is necessary for the game 
to be of complete information in order to immediately obtain 
the NE/CE of the matrix game, it is required that the learning 
agents should keep track of the entire Q-table from all the 
other agents at state s in order to compute the exact stage- 
game equilibrium. In practice, exchanging such information 
will result in a large transmission overhead, which is usually 
unaffordable in a wireless network. As a result, most of the 
existing studies apply heuristic methods to approximate the 
matrix game equilibrium. One example of payoff approxima¬ 
tion at the NE of the matrix game can be found in 11781 . 
which decouples the wireless network into a group of Service 
Providers (SPs) and a single entity called Network Operators 
(NOs) for network virtualization. Each SP is responsible for 
reallocating the available spectrum resources to a group of 
end users, while the NO is responsible for allocating the 
time-varying spectrum resources to the SPs. Here, resource 
allocation through the interface between the NO and the SPs 
at each time slot is treated as an auction game with the NO 
acting as the auctioneer and the SPs acting as the bidders. 
The auction is performed following the Vickrey-Clarke-Groves 
(VCG) mechanism iflSl. The entire auction process in the 
stochastic environment is modeled as a discounted general- 
sum SG, in which the channel state and the traffic state are 
assumed to be Markovian and the SP action is the selection of 
value functions through choosing the transmit rate. In II178I . 
the matrix games of the original SG is referred to as the 
“current games”. Also, to avoid directly computing the value 
of V^,i(s) in (l66l l. a conjecture price which approximates the 
unit-rate price (strategy) of the NO in the future is introduced. 
A Q-value updating scheme which is analogous to the SAS- 
based Q-learning scheme is proposed, and the value of the 
conjecture price is updated using the subgradient method. 

For networking problems which do not possess the single- 
server-distributed-agents property as stochastic auction games, 
the equilibrium strategies can be learned by implementing an 
appropriate amount of local information exchange. In 11791 . 
the problem of traffic offloading in a stochastic heterogeneous 
cellular network is first formulated as a centralized discrete¬ 
time MDP and then as an SG. In the SG, a group of macrocell 
BSs try to offload their downlink traffic to their corresponding 
group of small-cell BSs, which operate in the open access 
mode and share the same band with the macro BSs. Before 
the learning mechanism is implemented, the authors in 01791 
employ a standard state abstraction procedure based on linear 
state-value combination (see our discussion in Section UlI-At . 
The Q-values (i.e., the payoff of matrix games) are updated 
with the gradient-ascending method based on the gradient of 
the new Q-values after state abstraction. The matrix game in 
a given state s is modeled as a “virtual game” with common 
payoff by allowing the macro BSs to share their instantaneous 
spectrum utility with each other. Also, the action of each BS 
is updated using e-exploration instead of directly computing 






















the mixed strategy of the matrix game. It is proved in □ 
that convergence (which may not be the NE) is guar; 
with probability one. 

A different approach to approximate the matrix game 
librium with only local information in the SG can be 
in csoi, nsi], which employ the learning methods f 
repeated games to learn the matrix game equilibrium strt 
and then use these intermediate strategies to appro) 
the state value E^*(s) of the original SG. In 01801 
interference mitigation problem with a finite action 
discrete powers for both the PUs and the SUs in a C 
modeled as a discounted-reward SG. In 01811 . the cross 
resource allocation problem for layered video transm 
in a CRN is modeled as a discounted-reward SG. Ir 
works, the goal of strategy learning is to find the CE o 
respective SG. Both works treat the matrix game at state 
as a repeated game and adopt the no-internal-regret le; 
method defined bv(l56ll- (l58T l to approximate the CE st 
7r*(s) at state s. Let 7fi(s) define the intermediate st 
that is obtained with (l58l l. Since with the no-internal- 
learning scheme, no action/payoff information exchai 
needed, the strategy estimation in the SG is solely has 
local information. The same method as in (l6?t is adopt 
Q-value updating, for which state value (s) under t 
strategy can be estimated as the expected payoff of the i 
game: 

aiGAi 

To further reduce the information-exchange overhead, the 
values of 7r*(st,a) and (5^j(st,a) can be replaced by the 
conditional local strategy (given the adversary actions) and the 
Q-table based on the local state-action pairs II181I . repectively. 
Such a two-fold, approximate learning scheme does not require 
the information exchange between wireless devices. However, 
compared with the original learning scheme in Algorithm [T] 
such a learning algorithm may suffer from using the non- 
CE policies in the matrix game and from the inaccurate 
estimation of V^j(st). Although empirical studies show that 
convergence can be achieved by the two-fold learning scheme, 
no theoretical support is available to guarantee the convergence 
to the CE. 

2) Conjecture-Based Learning: Consider the problem of 
unguaranteed convergence due to the inaccurate estimation of 
the equilibrium strategies in the matrix games with two-fold 
learning, the concept of “conjecture” ll37l about one player’s 
opponent policies is introduced in several recent studies II 1821 - 
lll84l . In an SG, the conjecture of agent i can be defined as 
any belief function Ci : S x Ai —>■ C, in which C is the space 
of agent i’s conjectures (e.g., about the opponents’ policies 
and states). In the case of policy conjecture, we can define 
c*(s,a_i) as the conjecture of opponent policy 7r_i(s) by 
agent i at time t. With only local information, the most widely 
accepted conjecture updating mechanism is 

c-+^(s,a-i) = c-(s,a_i) -f wf(7f(s,a0 - 7r-(s,ai)), (71) 

where 7f‘(s, Oi) is the so-called reference point and is assumed 
to be of common knowledge to all the players. With (fTTI) . the 



Fig. 16. Structure of underlay CR mesh network (adapted from EH). 

conjecture is used by local agent i to maximize its individual 
payoff in the condition of not knowing what the strategies of 
the other players are, or what their payoff functions are. GB 
is obtained based upon the assumption that the other players 
will be able to observe player Es deviation from the reference 
point 7r|(s,ai), and in response to such a deviation, they will 
deviate from their own reference point by a quantity that is 
proportional to this deviation llJTl . With conjecture Ci(s, a_i), 
the conjecture equilibrium can be defined as follows (extended 
from the definition in II182I '): 

Definition 8 (Conjecture equilibrium). In the stochastic game 
G, a configuration of conjectures c and a joint policy tt* 
constitute a conjecture equilibrium if'iicM 

c*(s,<)=c,(s,^*), (72) 

TT* = argmaxQi(s,7ri,c-(s,7ri)). (73) 

TTi 

We take [1831 as an example to explain the details of 
employing conjecture to learn in SGs. In 01831 . the power allo¬ 
cation problem in an underlay CR mesh network (EigurefThl) is 
studied. The multi-node power allocation process is modeled 
as an SG, in which the local binary state of a secondary link 
is determined by the SINR level of its receiver. The local 
payoff is measured by the power efficiency. Compared with 
the standard matrix-game-based strategy-learning mechanism 
in (l62l i- (l63T l. the authors in Gsa constructs the Q-table with 
only local states and actions. Here, the policy conjecture is 
introduced to approximately learn the matrix game equilibrium 
strategy and the Q-value of the SG. Based on the conjecture¬ 
updating scheme in (ItTT i. the Q-value updating mechanism is 
defined as follows: 

= {I - Q*)Q'^p fisi,ai)+ 

S c\{si,a-i)ui{si,ai,a-i)-^fi max 

(74) 

The local policy tt^ is updated using the Logit function (l42ll . 
It is proved in 11831 that the second term on the right-hand 
side of (l74l i is a contraction mapping operator and the learning 
scheme converges with sufficiently large number of iterations. 

3) Other Learning Algorithms in SGs: Eor algorithms that 
do not work in the framework of hierarchical learning that is 
separated into learning in the matrix games and the original 
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Algorithm 2 Two-layer learning mechanism in the SG. 
Require: Initialize ^ and 7r|, VI < * < lA/”!. 
while convergence criterion is not met do 

Outer loop; 1^^^ ^ UpdateStateValue(u-, 1/^ j, tt*,T r^j) 

Inner loop: 7r-+^ ^ UpdateStrategy(yJ_., tt-,tt^■). 

end while 


SG, we simply refer them to the category of the “other learning 
algorithms”. In these algorithms, the Q-learning-based value- 
iteration scheme for the payoff of the matrix game may not 
necessarily be applied, or the computation of the state value of 
the SG may not be needed. Due to the complexity of a general 
SG, most of the existing learning methods in this category 
cannot be represented by a single prototypical algorithm. 

We note that for an SG, the property of the MDP generally 
requires that the state value of the game be computed following 
the Bellman optimality equation (in the general form as (O), 
whenever a stationary policy is to be obtained. Extending from 
the value-iteration-based algorithm, we can construct a general 
learning scheme, which is composed of two learning loops: an 
inner loop that uses an appropriate scheme to approximate the 
SG equilibrium strategies tt* and an outer loop that employs 
an appropriate method to estimate the state value V^_i(s) of 
each player. Within this general framework, the construction of 
matrix games is not necessary. We can generalize the two-layer 
learning process in SG G = Pr(s'|s, a)) 

as Algorithm 12] 

One widely-used two-layer approach for strategy learning 
in wireless SGs is to adopt FP-based policy updating as 
the inner-loop learning scheme. Such an approach of policy 
evolution can be found rooted in the model-based learning 
algorithms (namely, with known state-transition maps) II 1851 . 
Since the standard FP-based algorithm with (|26]l and (i27] | 
requires that each wireless node to track the opponent actions, 
extending FP-based learning from repeated games to the SG 
is considered a challenge due to the explosion of state-action 
dimensionality. In 11861 . such a challenge is resolved by 
regulating the SG into a sequential game, in which only one 
wireless node is allowed to update its action in each round. 
In 11861 . the problem of joint channel selection and power 
allocation for the SUs in an overlay DSA network is studied. 
With the assumption of a sequential game, each SU adopts a 
standard SAS-based Q-learning scheme as in (|7]l for updating 
the Q-table based on the local state-action pairs. To further 
reduce the state-action space, Q-learning is only applied to the 
strategy-learning for channel selection. The power adaptation 
is performed only after the channels are selected by the SUs. 
The FP-based mixed-strategy-updating scheme in 1 1861 can be 
considered as a variation of the best-response-based strategy 
learning schemes described in (i28l l. 

It is also necessary to consider a different approach to 
update the state value for FP-based learning when the players 
in the SGs update their strategies simultaneously, because the 
state value of the MDP cannot be easily estimated by only 
tracking the opponents’ actions. For those works that directly 
estimate the state value without using the TD-learning-based 
methods, it is also necessary to track the frequency of state 


transition in order to estimate the state transition probabilities. 
Examples of learning the state transition can be found in ll87l . 
Us). In ll87l . secondary wireless stations compete with each 
other for network resources to transmit delay-sensitive in a 
stochastic CRN. In 1881 . a similar problem is specified in an 
overlay CRN with SUs competing for the vacant primary chan¬ 
nels and determining transmitting parameters in a cross-layer 
manner. In both works, with the resource allocation problem 
in the CRN being modeled as SGs, it is required that the state 
transition frequencies of the opponents’ local states are tracked 
by each SU. In order to reduce the information exchange 
overhead about local state transitions, an SU abstracts the state 
space by classifying the opponent SUs’ state space purely 
based on its local observation. Instead of learning the real 
state-transition frequencies, the transitions of the abstracted 
state are recorded. The state value of the SG is updated based 
on the reduced states using the standard Bellman optimality 
equation ©. 

The special structure of some SGs can also be exploited 
to simplify the learning process for the FP-based learning 
mechanism. One example of such exploitation can be found in 
ClTl, which models the distributed dynamic routing in multi¬ 
hop CRNs as an SG (Figure [T tTi. Since the states of the routing 
SG in 11871 are defined as the state of channel availability in 
the CRN, the SG is featured by the state transitions which only 
depend on the PU activities. The SUs in the network attempt 
to find the route for minimizing the packet-forwarding delay 
due to queueing and channel collision while keeping their 
interference to the PUs as small as possible. Since the delay 
over a path is equal to the accumulated delay caused by each 
link in the path, and the state transition is independent of the 
SU’s actions, the original SG in II18711 can be decomposed into 
a group of layered, stochastic subgames. Each subgame cor¬ 
responds to a hierarchy levels in the routing path (see Figure 
[I7]l. The structure (i.e., the payoff matrix) of each subgame 
can only be determined when the cost (measured in delay) 
of the next-layer game is determined. A backward induction 
method is adopted in II18711 to compute the equilibrium payoff 
in the layered routing game. The computation starts from the 
subgame of the layer which ends at the sink SU to the subgame 
of the layer which begins from the source SU. Since the state 
transition is independent of the SU’s actions, the stochastic 
subgame in each layer can be reduced to a group of repeated 
games with fixed states. Therefore, the learning of state value 
becomes unnecessary and FP-based learning guarantees the 
convergence to the global NE, as long as the routing costs at 
the equilibrium point of each subgame are properly propagated 
to their lower layers. 

In addition to learning algorithms that follow Algorithm|2] a 
number of miscellaneous learning mechanisms are applied to 
SG-based problems in wireless networks. In order to reduce 
the requirement of information exchange or to achieve con¬ 
vergence, most of these learning mechanisms exploit special 
properties from the SG. As we have discussed in Section llV-AI 

According to fTsTI . the hierarchy levels of the CRN are calculated along 
the “media axis”, which is composed of a set of points. At these points, the 
lowest detection probability density of the PU’s activities is (approximately) 
achieved. 
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Fig. 17. A snapshot of a hierarchical multi-hop CRN under the PU 
interference footprint (adapted from GUI). 

for the Aloha-like spectrum access problem in CRNs 01261 . the 
near-NE policies of the stochastic access game can be obtained 
if all the SUs update their local policies with the Logit function 
(l42li . and the Q-value at state s is updated following (l22t . 
In this specific scenario, the two-layer learning mechanism 
based on Q-value updating ensures the convergence to near-NE 
strategies of the SG without the need of any information ex¬ 
change. In 01881 . 01890 . the structural property of a constrained 
SG is explored. Specifically, consider a utility-minimizing SG 
G = (A/',5, xA,{ci}jeAA,{ciJigA^,Pr(s'|s,a)) with a as 
the instantaneous local cost in the objective and di as the 
instantaneous local cost in the constraint. If the following 
assumptions are satisfied with G: 

Al) the set of policies that satisfy the constraint of the SG is 
non-empty, 

A2) the two cost functions Ci and di are multi-modular func¬ 
tions with respect to the actions and the state elements 
whose transition is a function of the joint local actions, 
A3) the transition probability Pr(s'|s,a) is submodular with 
respect to the actions and the state elements whose 
transition is a function of the joint local actions, 
then G has the following property in the structure of the NE: 

Theorem 8. Assume A1-A3 hold, then the NE policy of each 
player i, tt*, is a randomized mixture of two pure policies: 
tt\ and tt^. Each pure policy is nondecreasing on the state 
elements whose transition is determined by the joint actions. 

Based on Theorem 0 the search for NE policies tt* can be 
reduced to finding a randomized mixture of discrete actions in 
the finite action set. A policy-iteration-based strategy-learning 
algorithm can be developed based on the Simultaneous Pertur¬ 
bation Stochastic Approximation (SPSA) algorithm 11901 . In 
II188L the rate adaptation problem in a TDMA-based CRN is 
modeled as an SG with a latency constraint. In M189I . the prob¬ 
lem of joint source-channel rate adaptation in order to transmit 
layered video in a multi-user wireless local-area network is 
also formulated as an SG with the latency constraint. In both 
works, by showing that the assumptions A1-A3 hold in their 
respective SG-based model, the SPSA algorithm is applied for 
policy-learning. With the SPSA algorithm, no explicit state 
value learning is needed, and the local policies are updated 
with a gradient-based method with random policy perturbation. 
Given that the assumptions A1-A3 holds in the SG, the SPSA 
algorithm is proved to converge in distribution to the Kuhn 


Tucker (KT) pair of the original constrained MDP (Theorem 
3 in lfT88l ). 

In ESI, another distributed learning algorithm is constructed 
based on the framework of Lr-j learning in the team SGs. 
A team SG can be considered as a variation of potential 
games when all the players in a SG share the same pay¬ 
off function (i.e., fully-cooperative SG). With the proposed 
learning scheme, an LA is maintained for every state of the 
underlying Markov chain by each player in the SG. At any 
time instance, only one LA is activated by each player to learn 
its optimal action probabilities in the corresponding state. The 
introduction of LA reformulates the stochastic game between 
the jA/”! players into a repeated game between the jA/”! x |iS| 
automata. Extending from the special case of team SGs, the 
convergence condition of the LA-based learning scheme for 
SGs is generalized by the following theorem; 

Theorem 9 (El). For SG G = {Af,S,A,{ui},PTis'\s,a.)), 
assume that the multi-agent Markov chain corresponding to 
each joint policy, 7r(s), is ergodic. If Tt*{s) is a pure NE 
policy in the view of jA/”! players in G, 'tr*(s) is also a pure 
equilibrium for the reformulated game between the jA/”! x |iS| 
LA, and vice versa. 

According to Theorems |4] and H whenever an NE point 
in pure strategies exists in an SG (which is always the case 
for team SGs), the LA-based learning algorithm proposed 
in El is guaranteed to find the NE. However, it is worth 
noting that only maintaining an independent, repeated-game- 
based learning process (e.g., LA or SEP) for each state by the 
players may not necessarily produce the NE strategies for a 
general-case SG. Take the SEP learning scheme for example. 
In a general-case SG, the action-dependent state transition 
renders the Logit function in OTI) no longer the solution to 
the perturbed best response. As a result, a Lyapunov function 
can not be found in the same way as for repeated games and 
the convergence property of the corresponding best-response 
dynamic in Robbins-Monro form is undermined. Therefore, 
special structure is required for the SGs if the repeated- 
game-based learning processes are to be adopted. In MI9II . a 
sufficient condition is given for the adoption of the CODIPAS- 
RL learning schemes (more specifically, LA and SEP-based 
learning) in the general-case two-player nonzero-sum SGs; 

Cl) the state transitions are independent of the player actions. 

It is easy to prove that given condition Cl, by fixing the state 
variable and solving for all the state-dependent NE with the 
repeated game-based learning algorithms discussed in Section 
IV-AI we are able to obtain the state-independent NE of the 
two-player nonzero-sum SGs. The conclusion can be further 
extended to A^-player games. When the state transitions are 
also independent of the current state, each player only needs to 
maintain a single learning process (see the examples in II191L 
M192II ). However, due to the constraint on the state transition 
conditions, only a few applications of the SEP/GP/LA-based 
algorithms for the SGs-based network control problems can 
be found in the literature 11921 . 11931 . 
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VI. Challenges and Open Issues in Model-Free 
Learning for Cognitive Radio Networks 

In this section, we expand our discussion to the challenges 
and open issues that are yet to be addressed in the area of 
learning for distributed control and/or wireless networking. 
In Section IVI-Al different aspects of the learning mechanism 
goals are reviewed, and the potential conflict between these 
aspects is discussed. In Section IVI-Bl we propose a problem 
to cope with the outlier agents who do not (necessarily) 
follow a given learning rule in a learner set. In Section IVI-CI 
the possibility of transferring experience from one learning 
scenario/process to a difference learning scenario/process is 
discussed. In Section IVI-Dl we discuss a problem on the coor¬ 
dination among simultaneous learning modules over different 
protocol layers for the same network entity. 

A. The Goal of Learning: Self-Play, Stability and Optimality 

Generally, the goal of a perfect self-organized learning 
mechanism for multi-agent decision making processes is to 
achieve self-play (autonomy), stability and optimality at the 
same time. However, it has been well-recognized that for 
multi-agent learning (more frequently in a stochastic scenario), 
improving system performance typically incurs more signaling 
and coordination, thus undermining the self-play structure. 
Especially, when learning is implemented under the framework 
of games, achieving any two goals of self-play, stability and 
network optimality is usually at the cost of undermining the 
third goal. In recent years, the relationship between the three 
parties of the goals in multi-agent learning has been discussed 
in many works, but mostly from a high-level theoretical 
perspective Ea, Iml, lfT94l . 

In regard to the applications of learning in wireless net¬ 
works, the situations that have been discovered to keep con¬ 
sistence between a distributed solution and an optimal solution 
are limited within a small scope. One important case of these 
situations is the network control problems that is modeled 
as a potential game Gil. For potential games, the following 
properties d make it possible to achieve convergence to 
the optimal operation point through adopting the learning 
algorithms that we discussed in Section IV-AI 

• Every potential game has at least one pure strategy NE. 

• Any global or local maxima of the potential function 
defined in the game constitutes a pure strategy NE. 

Based on the above properties, it is only necessary to prove 
the uniqueness of the NE in a repeated game for learning 
processes to achieve optimal operation point with sequential 
best-response play ll^ or no-regret learning. Apart from the 
works discussed in Section FV-AI the applications of distributed 
learning in potential games in order to achieve global opti¬ 
mization can usually be found in a set of congestion-game- 
like problems such as 01951 . 01961 . However, the potential 
game requires that local users are able to (implicitly) perceive 
the utilities of the entire network in order to establish the 
correspondence between the local utility function and the 
constructed potential function Since this requirement 

is at the cost of trading off the conditions for self-play, it 


significantly limits the applications for the potential-game- 
based learning algorithms. 

For other model-free distributed learning mechanisms in 
a multi-device wireless network, how to coordinate the goal 
of optimality and self-organization when adopting a learning 
scheme generally remains an open question. As a result, most 
current studies focus on ensuring convergence to the stable 
operation point in self-play by allowing a limited level of 
control signal exchange. Although there are a few already- 
known conditions that ensure the convergence of a learning 
algorithm, most of which are applicable to repeated games 
(e.g.. Theorem |2] and O, for most current studies, whether 
a stability condition can be found for a learning scheme 
also remains an open issue. In the literature, the approaches 
to And the convergence condition of the learning algorithms 
generally fall into two major categories. For learning processes 
that can be approximated with a linear system described 
as a set of ODEs in continuous time, the typical way of 
obtaining the convergence condition is to construct a Lyapunov 
function for the ODE-based dynamic and then prove that the 
strategy/utility updating mechanisms produce an asymptotic 
pseudo-trajectory of the flow defined by the ODE through the 
stochastic-approximation-based analysis (see the example in 
Jill, 111521 1. The analysis of learning using the ODE-based 
approach can be found in 01261 . 01421 . 01451 . 01591 . 01971 . 
For the situations which cannot be easily modeled as an linear, 
ODE-based system, the contraction-map-based analysis (see 
the example in ll66l l can be considered as an alternative. 
Usually, the contraction map is considered appropriate for the 
analysis of SG-based learning when modeling the problem 
is of high complexity 01831 . 01841 . Table IXIII summarizes 
convergence conditions for the multi-agent learning algorithms 
discussed in Sections HIUTvI 

In addition to the issues associated to finding the con¬ 
vergence condition for a learning scheme, another concern 
when applying model-free learning in wireless networks is the 
convergence rate of learning algorithms. Although analytical 
results for the convergence rate of learning algorithms are 
highly desired, most of the existing studies are only able 
to show empirical results for the learning convergence rate 
through numerical simulations (see the examples in 1881 . 
11831 1. The reason for this is partly due to the asymptotic 
convergence condition (if there is any), which requires for 
most of existing learning algorithms that the states and actions 
are visited infinitely to ensure the convergence. Given such a 
limitation, one known approach to analyze the convergence 
speed of a learning scheme is to view the learning process 
itself as a discrete time Markov chain. In this approach, the 
standard Markov chain analysis can be applied to obtain the 
expected time (number of iterations) to learn before reaching 
the chain’s absorbing state (e.g., the equilibrium point of 
a repeated game). Such a technique can be found in the 
recent studies 11981 . 11991 . In 11981 . the Markov-chain-based 
analysis is used to measure the lower bound of the iterations 
needed for the Logit-function-based learning scheme to leave 
a sub-optimal NE in a potential game for gateway selection 
11981 . In 11991 . the same method is employed to track the 
average iterations that a trial-and-error-based learning method 
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TABLE XII 

A SUMMARY OF THEORETICAL CONVERGENCE CONDITIONS FOR THE MAS-BASED LEARNING ALGORITHMS 


Problem 

Formulation 

Category 

Learning Scheme 

Convergence Condition 

Stable Operation 

Point 

Required Signaling 

Loosely Coupled 
MAS 

Distributed (indepen¬ 
dent) Q-leaming 

Generally not known 

Sub-optimal 

None 

Repeated Games 

Standaid FP 

Not guaranteed except in (a) two-player 
games and (b) multi-player game with com¬ 
mon payoff 1711 

e-NE 

Exchange of local- 
action information 

Stochastic FP 

Not guaranteed except in (a) potential 
games, (b) supermodular games (c) two- 
player zero-sum games and (d) two-player 
symmetric games 1761 

e-NE 

None 

Gradient play 

Conditional convergence tor strict NLs in 
multi-player games 1711 

NE 

Exchange of local- 
action information 

l^R-I 

Conditional convergence tor strict NLs in 
multi-player games (see Theorem |4) 

e-NE 

None 

N o-extemal-regret 
learning (Hedge) 

Potential games 

NE 

None 

N o-mternal-regret 
learning 

CL in multi-player games 

Non-social-optimal 
CE (72) 

None 

Stochastic Games 

Minimax Q-leaming 

Not known 

NE 

Knowing the structure 
of local payoff func¬ 
tion 

Nash Q-leaming/R- 
leaming 

Each matrix game has a unique NE 1661. 

(sa 

NE 

Exchange of local ac¬ 
tion/payoff information 

Conjecture leai'ning 

Conditional convergence 

Conjecture equilib¬ 
rium 

Knowing the reference 
point 

FP-based policy up¬ 
dating 

Generally not known 

NE 

Exchange of local- 
action information 


needs for reaching the NE of a joint channel-power selection 
game for the first time. However, such an approach could be 
computationally intractable when the system/learning scheme 
is too complicated, and it is yet to be found applicable to the 
more complex learning algorithms such as those in the SGs. 

B. Heterogeneous Learning and Strategic Teaching in the 
Context of Games 

For the existing studies of strategy learning in wireless net¬ 
works, one most important assumption is that each individual 
agent abides by the same learning rule (or just uses variable 
parameters for the same learning scheme). Only with such an 
assumption, the convergence properties of the learning scheme 
can be mathematically tracked. However, in many practical 
scenarios, especially in the scenarios when malicious nodes 
exist in the network, such an assumption may not be applicable 
and the malicious nodes may intentionally deviate from the 
given learning rule. One possible scenario of such a case can 
be found in a selective-forwarding-based attack-defense game, 
in which a sophisticated attacker with the ability of selectively 
forwarding the received packets may wait and abide by the 
normal packet forwarding rule until some critical packets are 
sent to it before dropping. To the best of our knowledge, 
currently there are few (if not any) works discussing this 
situation. 

To further demonstrate the situation in which a learner 
may benefit by deviating from a common learning rule, we 
introduce the concept of “strategic teaching”, which is hrst 
discussed in the studies of economic games lISOQll . With 
strategic teaching, it is assumed that the game is composed 
of a number of adaptive players and sophisticated players. 
An adaptive player learns its strategy following the learning 


scheme that it is assigned to. By contrast, the sophisticated 
players are able to adopt a non-myopically optimal strategy 
and afford a certain short-term loss. Since the adaptive learners 
will finally learn the best response to a pre-committed strategy 
by the sophisticated player under the given learning scheme, 
the sophisticated players will be able to induce the adaptive 
players to expect some specihc patterns of strategies from 
them in the future II200II . Then, the sophisticated players 
will be able to take advantage of the behavior patterns that 
they “teach” the adaptive players. It has been found that a 
sufficiently patient strategic teacher can achieve as much utility 
as from hrst-play in a Stackelberg gam^i^ 120011 . Thus, the 
sophisticated play may become a favorable way of strategy 
adoption for a noncooperative or a malicious node in the 
wireless network compared with the way of strictly following 
the same learning rule. 

In II200I . a heuristic, model-free learning method known 
as Experience-Weighted Attraction Learning (EWAL) 013911 
is applied to a repeated trust game (i.e., lender-borrower 
game) as the basis of both adaptive learning and sophisticated 
learning. In that game, M borrowers try to borrow money 
from each of a series of N lenders. A lender only makes 
a one-time binary decision on either Loan or No Loan in a 
single round out of a iV-round game. A borrower makes a 
series of N binary decisions on Repay or Default regarding 
each lender that it borrows money from after observing the 
lender’s decision. The sequences of the W-round stage-games 
(also known as supergames) are repeated for many times 
with a random order of lenders to make decisions with each 
sequence. In one sequence, one borrower is picked as the 

About the difference of a Stackelberg equilibrium and an NE, the readers 
are refen'ed to m for more details. 
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common borrower in the game. All the lenders and some of the 
borrowers play as adaptive players and learn their strategies 
with EWAL. The rest of borrowers are assumed to be dishonest 
and adopt sophisticated play. It is assumed that the actions 
and instantaneous payoffs of one player are observable by 
the other players. For the adaptive players, EWAL uses the 
Logit-function-based rule as in (l42l l for strategy updating. 
Instead of directly using the instantaneous/accumulated payoff 
as the argument of operator exp(-) in the Logit function, 
EWAL introduces the concept of experience accumulation 
through reinforcement and employs two new measurements 
to build local experience: the observation-equivalents of the 
past experience and the attraction to a specific strategy II139I . 
The former is similar to the action-frequency estimation in FP 
and the latter is used as the argument of the Logit function. 
In the game, the adaptive players apply EWAL twice to build 
their attraction first within a lending-borrowing sequence (i.e., 
supergame) and then across the consequent sequences. For 
the sophisticated borrowers, the learning process does not dif¬ 
ferentiate between attraction building within a supergame and 
across different supergames. A sophisticated borrower guesses 
how the lender learns according to the attraction value of the 
adaptive lender that it observes. Then, the policies of default 
and repay are sought by incorporating estimated policies of the 
lenders into the computation of its own sophisticated attraction 
function (see Section 4.1 of M200II for the details). It has 
been demonstrated in II200II that by adopting sophisticated play 
with the attraction updating mechanism based on lender policy 
estimation, the dishonest borrowers are able to outperform the 
adaptive borrowers which follow the same EWAL learning rule 
as the lenders. For simplicity, the mechanism of sophisticated 
play can be interpreted as playing additional tricks to the 
adaptive lenders by repaying frequently enough so if the 
dishonest borrowers do default, it won’t lower the belief 
probability of the lenders about the trustworthiness of these 
borrowers below a critical level. Such an example provides an 
important insight into the possible strength of sophisticated 
play in repeated noncooperative games. However, few studies 
discuss such an issue in the context of wireless networks. 
Also, it is generally not clear how strategic teaching with 
sophisticated play in other forms can be enforced or avoided 
in the current framework of learning and in what ways it will 
affect the equilibria that can be reached. 

C. Experience Transferring between Heterogeneous Learners 

As we note from Sections imYi one of the significant 
benefits of model-free learning is to allow the decision-making 
entities to learn the strategies from scratch without the a- 
priori knowledge of the wireless network. However, since 
model-free learning is based on trial-and-error, when the 
network environment has dramatically changed, the learners 
generally need to start the same learning process from the 
very beginning. One example of such scenarios can be found in 
interference mitigation problem for cellular networks, in which 
mobile stations may enter or leave the network frequently. 
For most of the existing model-free learning algorithms, such 
changes in the network topology mean the changes in the MDP 


model of the network with new dimension of states/actions, if 
MDP-based learning is adopted, or the transition from an old 
network-control game to a new one since the set of players 
is different, if game-based learning is adopted. As a result, 
when it is required that the decision-making agents swiftly 
switch from an old scenario to a new one, the existing learning 
methods will face great challenges if they can only restart the 
learning process in the new scenario. 

In order to address such a challenge, a natural consideration 
is to utilize the acquired experience of strategy taking which is 
obtained from the old scenario. We note that such a process is 
fundamentally different from the experience sharing process 
discussed in Section IIV-BI since for the experience-sharing 
framework such as docitive networks, the parallel and homo¬ 
geneous learning processes are assumed so the expert agent 
is able to share its better experience of the same stochastic 
process with the newcomers. In the scenarios of dramatical 
environmental changes, the experience transferring paradigm. 
Transfer Learning (TL) II138I . is considered more appropriate 
for the tasks of sharing experiences of strategy taking be¬ 
tween heterogeneous learning processes. Compared with the 
experience transferring between homogeneous learners, the 
motivation of TL is to transfer knowledge (i.e., experience) 
from the well-established learning processes (known as the 
source tasks) to the newly established learning processes 
(known as the target tasks) in a different situation. It is worth 
noting that under the framework of MDP-based learning, TL 
allows the difference in state spaces, state variables/transition, 
reward functions and/or sets of actions CMl. 

TL has been considered difficult to implement for learning 
in wireless networks. This is mainly due to the fact that 
it is difficult to find a proper mapping (either in value- 
function representation or directly in policy transferring 01381 1 
to transfer between learning tasks with different action-state 
representations. For the applications in wireless networks, one 
example of policy-transferring TL can be found in 02011 . In 
02010 . a highly dynamic opportunistic network which is based 
on LTE-A is studied. The network topology is assumed to 
change with time, and the eNodeBs (eNBs) are supposed to 
be responsible for learning channel allocation under the condi¬ 
tions of mutual interference among the user equipments. The 
mechanism of policy transferring is adopted on the basis of 
two model-free learning algorithms: the linear reinforcement 
learning and the single-state Q-learning. The former employs 
a simple, linear updating function for state-value updating, 
while the latter applies Q-learning to update a state-less Q- 
table. For TL, one shot of the changing network topology is 
considered as a learning phase, then the objective of TL is 
to apply the experience learned in previous phases (sources) 
to the similar phases (targets) in the future. The eNBs which 
attempt to assign channels to the user devices for interference 
coordination work as the learning agents and obtain the 
spectrum priority through sorting the Q-table obtained in the 
current phase in a descent order. A policy function is designed 
to transfer the Q-table learned in a previous phase to the 
new phase through assigning weights to the source priority 
table to the target priority table in the new phase. Such a 
procedure of associating the channel priority in the target 
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Fig. 18. Architecture of the policy-transfer mechanism in the LTE-A based 
opportunistic network 120 T 1 . 

task with the channel priority in the source target can be 
considered as initializing the learning process in the new 
phase with the transferring knowledge from the old phase. 
Thereby, the information from transfer learning and distributed 
learning is combined through weighting the values of channel 
priorities. The Q-table in the new phase is learned with the 
given reinforcement learning methods. The policy transferring 
process in M201II is demonstrated in Figure [18] 

A different approach of applying TL to the wireless net¬ 
working problems can be found in M202I . where the authors 
apply TL to a series of actor-critic learning processes to coor¬ 
dinate BS switching/sleeping in a cellular network. In 02021 . 
the possibility of improper guidelines provided by transferred 
knowledge of the old task to the new task is considered. The 
actor-critic learning scheme is performed by a BS-operation 
controller, and is based on a multi-state MDP model for the 
traffic load of the serving BSs. Compared with 02011 . the 
difference of the TL mechanism in 12020 lies in the way of 
adopting the transferred policies. Instead of using the static 
transferred knowledge for the initialization of the new learning 
phase, the experience in the new learning phase is divided 
into two sources; the “native policies” obtained through actor- 
critic learning and the “exotic policies” obtained as transferred 
policies from old tasks. The weight of the exotic policies 
contributing to the overall strategy selection decreases as the 
native learning process progresses. The learning-knowledge- 
transferring process is demonstrated in Figure [T9| It is math¬ 
ematically proved that regardless of the initial value of the 
overall policies and the transferred policies, the actor-critic- 
learning-based algorithm is guaranteed to converge. Also, 
numerical simulations show that TL does improve the learning 
speed when compared with the reinforcement learning meth¬ 
ods without TL. 

In the literature, most of the applications of TL in wireless 
networks are set in the scenarios which can be modeled as 
MDP-based MAS. With all the existing effort for establishing 
a general framework of applying TL to learning in wireless 
networks, the following questions are to be answered; 

1) Whether and how can TL be applied between related 
games (e.g., symmetric games with the same structures of 
payoffs and actions, but with different sets of players) for 
accelerating the convergence speed to the equilibrium? 

2) How can we measure the efficiency of knowledge trans¬ 
ferring? 

3) Apart from policy transferring and value-function trans¬ 
ferring, can TL also be applied to heterogeneous learning 
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Algorithm Algorithm 

Fig. 19. Architecture of the transfer-actor-critic algorithm HM . 

processes using different learning schemes? 

In the literature, few studies in wireless networks are found 
discussing the aforementioned topics. However, discussions on 
cross-game learning or cross-mechanism learning have already 
begun in the area of economic games 120311 and automatic 
control 120411 . Although the detailed discussion on these topics 
is beyond the scope of this survey, it is believed that addressing 
these issues will bring great improvement to the existing 
learning mechanisms in the CRNs. 

D. Coordination of Learning Modules: Integration vs. Decom¬ 
position 

In addition to the problems in heterogeneous learning 
processes, handling experience sharing or transferring knowl¬ 
edge among different network devices, the coordination of 
simultaneously learning modules may still be a challenging 
issue even within a single network device. As shown in our 
previous discussion, learning processes targeting at various 
functionalities (which may or may not involve the interactions 
with other users) can happen in any layer of the protocol 
stack (see Figures 13 M and [13 for example). Although 
many existing works have succeeded in applying the learning- 
based solution to their dedicated functionalities, a systematic 
discussion on coordinating these learning processes for differ¬ 
ent functionalities generally remain untouched in the current 
research progress. In the seminal work II205L it is pointed out 
that different functionalities across the protocol layers may 
exhibit a range of conflicts and/or dependence when working 
concurrently in the same network. Thereby, it becomes a 
natural idea to consider the solutions to the learning module 
coordination by hrst identifying the conflicts or dependence 
in practical scenarios. 

Based on the work in I1205L 02061 . we consider the fol¬ 
lowing major conflicts and/or dependence among different 
network functionalities; 

1) Logical dependence; this kind of dependence may arise 
when there is a logical dependence between the objectives 
of different network functions. 

2) Parameter conflict/dependence; this kind of conflicts or 
dependence is triggered either when different networking 
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functions try to modify the same configuration parameters 
or when the parameters of one function depends on some 
other network parameters. 

3) Measurement conflict; measurement conflicts exist if a 
learning module depends on the state of the other learning 
modules. 

Logical dependence happens when different learning mod¬ 
ules exhibit a hierarchical dependence on the output of each 
other. In this sense, the relationship between different learning 
modules in a CR device shares a lot of similarity with the rela¬ 
tionship between the subtasks of a hierarchical reinforcement 
learning mechanism 12071 . 12081 . The major difference is that 
in (MDP-based) hierarchical reinforcement learning, a single 
learning process is decomposed into a number of subtasks 
with their own sub-states, actions, transition functions and 
rewards in a top-down manner with the help of recursive value 
function decompositior^l 112081 . Since hierarchical learning 
requires to finish each child learning task before starting its 
parent task, it extends the MDP-based system model into a 
semi-MDP-based system model, in which the amount of time 
for the transition from one action to the next is a random 
variable due to the existence of the sub task sequences. By 
adopting the general idea of hierarchical learning, learning 
coordination with logical dependence can be considered as 
a reverse process of hierarchical learning by integrating the 
existing learning modules according to their dependence and 
forming a macro learning task. Practically, such an operation 
of module concatenation may be extended to the non-MDP- 
based learning mechanisms. For example, in 11801 . 01811 a 
hybrid structure of both MDP-based Q-learning and repeated 
game-based no-regret learning is formed to approximate the 
equilibrium strategy of an SG. In those two cases, the expected 
utility based on the learned equilibrium of the repeated game 
can be considered as the instantaneous utility of the parent- 
level Q-learning process. However, the major difficulty in 
applying a hierarchical learning-based coordination mecha¬ 
nism lies in the uncertainty of convergence, as we have 
highlighted in IV-BI Unlike the well-established examples of 
hierarchical learning in the domain of robot control 02071 . For 
the applications in CRNs there usually exists no terminal state 
for a subtask to determine when to stop its execution. As a 
result, when to start and terminate a task in the framework of 
hierarchical learning are usually determined empirically, and 
the convergence conditions of such a learning process still 
remains an open issue. 

Unlike logical dependence, parameter conflict/dependence 
and measurement conflict are caused by the conflicts of the 
actions and states in different learning modules, respectively. 
For example, parameter conflict may happen between the inter¬ 
cell interference control and the coverage/capacity optimiza¬ 
tion modules of a cellular network. With respect to downlink 
transmit power control, the interference control module may 
want to decrease the transmit power in order to reduce the 
inter-cell interference, while the coverage/capacity optimiza- 

general principle for a hierarchical value function decomposition is that 
the reward function of a parent task is the state-value function of the child 
task Enn. 


tion modules may want to increase the transmit power to 
improve the local link quality at the same time. For those 
two kind of conflicts, one traditional solution is to build a 
decision tree to activate different decision modules according 
to the pre-determined conditions, which is also called trigger- 
condition-action points 112051 . However, the trigger-condition- 
action based solution is a typical model-based method, and 
thus cannot be directly incorporated into the coordination 
process of learning modules. 

Although no prototypical solution has been proposed to 
resolve conflicts 2) and 3), it is still possible to address 
these conflicts by imitating the existing model-based methods 
when some certain property can be found in the learning 
modules. Consider a general case where a number of learning 
modules share a subset of network states, and try to learn the 
strategy on the same action parameters to achieve different 
goals. To resolve the conflicts, we can adopt the idea of 
layering by decomposition in ifThl to coordinate the learning 
modules. One typical way of doing so is to pick the objective 
of one network functionality as the major goal and treat 
the goals of all the other functionalities as constraints. It is 
worth noting that such an operation can be also considered 
as a way of integration. However, the ultimate goal of it is 
to create a structure of optimization which suits the further 
operation of decomposing it into interrelated but layered 
learning processes. A revisit to the work on layered Q-learning 
for video compression ll93l helps to exemplify such an idea 
in details. In ll93l . a multimedia processing system considers 
three different concurrent objective functions, which are the 
video distortion at the codec level, the queueing delay for 
video frame processing in the pre-encoding buffer, and the 
energy cost in the OS/hardware layer. The distortion and 
queueing delay can be treated as two objective functions in 
the application layer of the system sharing the same system 
state, while the configuration that defines the energy cost 
(the operating frequency in this case) also determines the 
distortion of the compressed video. In 19^ . minimizing the 
queuing delay is considered as the main objective, and the rest 
two objective functions are treated as constraints. Conflicts 
between different functionalities can be easily found in this 
case, since increasing the operating frequency will lead to a 
better video quality but result in more energy consumption. By 
creating such a constrained optimization problem, a layered 
Q-learning mechanism is designed in a way that is similar to 
the procedure of dual decomposition. As briefly discussed in 
Section Iml a two-layer learning framework is created in the 
following way. In the application layer, the Q-learning module 
receives the signaling from the OS/hardware layer about its 
action (frequency selection) information, and learns the local 
state value. In the OS/hardware layer, the local learning 
process receives the estimated Q-value of the application layer 
as part of its instantaneous utility, and then learns its own 
state value. Unlike the hierarchical learning based integration 
method, layered learning based on decomposition does not 
require that one learning process to be finished first before 
another learning process starts. 

Like integration-based learning, the mathematical proof 
of convergence for decomposition-based learning is still 
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rarely discussed in the existing literature. In the mean¬ 
while, although considered more autonomous than the 
model-based coordination methods such as trigger-condition- 
action, decomposition-based learning needs a pre-determined 
constrained-optimization structure for layering of the learn¬ 
ing processes. Such a requirement may limit the ability of 
decomposition-based learning in quickly responding to the 
requests of a certain network functionality that cannot be 
reached in the given constrained-optimization structure. From 
this point of view, finding a satisfying tradeoff between 
different functionalities still remains an open question for 
decomposition-based learning coordination. 

VII. Conclusion 

Owing to the distributive nature of cognitive wireless 
networks, model-free learning is especially appropriate for 
the wireless nodes to adaptively choose their transmission 
strategies in a self-organized manner without much require¬ 
ment for knowing the network conditions. In this paper, we 
have provided a comprehensive survey on the applications of 
the state-of-the-art learning mechanisms in a wide range of 
scenarios of network modeling. With a broad-scope analysis 
and comparisons of the literature, we have focused on learning 
algorithms that can be categorized with a set of prototypical 
schemes. Briefly, these prototypical schemes includes MDP- 
based learning and experience sharing, conjecture-based learn¬ 
ing, FP/GP-based learning, LA-based learning and no-regret 
learning. We have classified the various scenarios for the 
applications of learning into three major categories, namely, 
the SAS-based network control, the loosely-coupled MAS- 
based network control and the game-based network control. 
We have mainly focused on the following characteristics of 
the selected learning algorithms: (i) the ability of the learning 
schemes to achieve optimality/equilibria without knowing an 
a-priori model for the environment, (ii) the ability of the 
learning schemes to achieve optimality/equilibria without ob¬ 
taining the information that is not locally available and (iii) the 
ability of the learning schemes to quickly adapt by exchanging 
experience. In addition to detailed reviews of the existing 
applications of learning in wireless networks, we have also 
discussed a variety of open issues that need to be addressed in 
future research. We hope this survey will serve as an important 
guideline for future research directions to further understand 
model-free learning mechanisms and expand their applications 
in cognitive wireless networks. 
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